参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- The database is PostgreSQL13
- First, execute the following cell
- You can issue SQL by writing %%sql in the cell
- You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
- You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
- It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
- If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
- Names, addresses etc. are dummy data and do not exist.
%load_ext sql
import os
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)
# Set up Magic commands to write SQL
%sql $dsl
'Connected: padawan@dsdojo_db'
How to use
- You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.
%%sql
SELECT 'You can run it like this' AS sample;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
sample |
---|
You can run it like this |
100 data processing knocks
S-081: Create new product data for the missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the average value of each. Note that for the average values, round off values less than one yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
%%sql
DROP TABLE IF EXISTS product_2;
CREATE TABLE product_2 AS (
SELECT
product_cd,
category_major_cd,
category_medium_cd,
category_small_cd,
COALESCE(unit_price, unit_avg) AS unit_price,
COALESCE(unit_cost, cost_avg) AS unit_cost
FROM
product
CROSS JOIN
(
SELECT
ROUND(AVG(unit_price)) AS unit_avg,
ROUND(AVG(unit_cost)) AS cost_avg
FROM
product
) stats_product
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10030 rows affected.
[]
Commentary :
This code is written in SQL language and it performs the following steps:
If a table namedproduct_2
already exists in the database, it will be dropped. This is done to ensure that we start with a clean slate.
A new table namedproduct_2
is created using theCREATE TABLE
statement.
The data for the new table is derived from theproduct
table using aSELECT
statement.
In theSELECT
statement, the columnsproduct_cd
,category_major_cd
,category_medium_cd
, andcategory_small_cd
are selected from theproduct
table.
Two new columns are created in theSELECT
statement using theCOALESCE
function.COALESCE
returns the first non-null value in a list of expressions. In this case, ifunit_price
is null, the average unit price of all products (unit_avg
) is used instead. Similarly, ifunit_cost
is null, the average unit cost of all products (cost_avg
) is used instead.
TheFROM
clause in theSELECT
statement performs aCROSS JOIN
with a subquery that calculates the average unit price and unit cost of all products in theproduct
table. TheROUND
function is used to round the average values to the nearest integer.
The result of theSELECT
statement is used to populate theproduct_2
table using theINSERT INTO
statement that is implicitly executed by theCREATE TABLE
statement.
In summary, the code creates a new table namedproduct_2
that contains the same columns as theproduct
table, but with null values in theunit_price
andunit_cost
columns replaced by the average unit price and unit cost values, respectively, calculated across all products.
%%sql
SELECT
SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
product_2
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
unit_price | unit_cost |
---|---|
0 | 0 |
Commentary :
This code is written in SQL language and it performs the following steps:
ASELECT
statement is executed to query theproduct_2
table.
TheSUM
function is used in combination with aCASE
expression to count the number of null values in theunit_price
andunit_cost
columns separately.
In eachCASE
expression,WHEN unit_price IS NULL
orWHEN unit_cost IS NULL
checks if the value of the respective column is null.
If the column has a null value, theCASE
expression evaluates to 1, otherwise it evaluates to 0.
TheSUM
function then sums up the 1's and 0's from each row to calculate the total number of null values in theunit_price
andunit_cost
columns, respectively.
TheAS
keyword is used to give the resulting columns in the result set meaningful names, i.e.,unit_price
andunit_cost
.
Finally, theLIMIT
clause is used to restrict the output to the first 10 rows.
In summary, this code returns a result set that displays the number of null values in theunit_price
andunit_cost
columns of theproduct_2
table. The output indicates how many rows have missing data in these columns, which can be useful in identifying data quality issues that need to be addressed.
S-082: Create new product data for missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the median value of each. Note that for the median value, round off the figures to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
%%sql
DROP TABLE IF EXISTS product_3;
CREATE TABLE product_3 AS
(
SELECT
product_cd,
category_major_cd,
category_medium_cd,
category_small_cd,
COALESCE(unit_price, unit_med) AS unit_price,
COALESCE(unit_cost, cost_med) AS unit_cost
FROM
product
CROSS JOIN (
SELECT
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_price)) AS unit_med,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_cost)) AS cost_med
FROM
product
) stats_product
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10030 rows affected.
[]
Commentary :
This code is written in SQL language and it performs the following steps:
If a table namedproduct_3
already exists in the database, it will be dropped. This is done to ensure that we start with a clean slate.
A new table namedproduct_3
is created using theCREATE TABLE
statement.
The data for the new table is derived from theproduct
table using aSELECT
statement.
In theSELECT
statement, the columnsproduct_cd
,category_major_cd
,category_medium_cd
, andcategory_small_cd
are selected from theproduct
table.
Two new columns are created in theSELECT
statement using theCOALESCE
function.COALESCE
returns the first non-null value in a list of expressions. In this case, ifunit_price
is null, the median unit price of all products (unit_med
) is used instead. Similarly, ifunit_cost
is null, the median unit cost of all products (cost_med
) is used instead.
TheFROM
clause in theSELECT
statement performs aCROSS JOIN
with a subquery that calculates the median unit price and unit cost of all products in theproduct
table. ThePERCENTILE_CONT
function is used to calculate the median values. TheWITHIN GROUP
clause specifies the ordering of the rows, so that the function knows which values to use to calculate the median.
TheROUND
function is used to round the median values to the nearest integer.
The result of theSELECT
statement is used to populate theproduct_3
table using theINSERT INTO
statement that is implicitly executed by theCREATE TABLE
statement.
In summary, the code creates a new table namedproduct_3
that contains the same columns as theproduct
table, but with null values in theunit_price
andunit_cost
columns replaced by the median unit price and unit cost values, respectively, calculated across all products. The median values are calculated using thePERCENTILE_CONT
function with aWITHIN GROUP
clause that specifies the ordering of the rows.
%%sql
SELECT
SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
product_3
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
unit_price | unit_cost |
---|---|
0 | 0 |
Commentary :
This code is written in SQL language and it performs the following steps:
ASELECT
statement is executed to query theproduct_3
table.
TheSUM
function is used in combination with aCASE
expression to count the number of null values in theunit_price
andunit_cost
columns separately.
In eachCASE
expression,WHEN unit_price IS NULL
orWHEN unit_cost IS NULL
checks if the value of the respective column is null.
If the column has a null value, theCASE
expression evaluates to 1, otherwise it evaluates to 0.
TheSUM
function then sums up the 1's and 0's from each row to calculate the total number of null values in theunit_price
andunit_cost
columns, respectively.
TheAS
keyword is used to give the resulting columns in the result set meaningful names, i.e.,unit_price
andunit_cost
.
Finally, theLIMIT
clause is used to restrict the output to the first 10 rows.
In summary, this code returns a result set that displays the number of null values in theunit_price
andunit_cost
columns of theproduct_3
table. The output indicates how many rows have missing data in these columns, which can be useful in identifying data quality issues that need to be addressed. Theproduct_3
table was created in a previous step, with null values in theunit_price
andunit_cost
columns replaced by the median unit price and unit cost values, respectively, calculated across all products.
S-083: Create new product data for the missing values of unit price(unit_price) and unit cost(unit cost), supplemented by the median value calculated for each product category sub-category code (category_small_cd). Note that the median value should be rounded to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no deficiencies for each item.
%%sql
DROP TABLE IF EXISTS product_4;
CREATE TABLE product_4 AS (
WITH category_median AS
(
SELECT
category_small_cd,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_price)) AS unit_med,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_cost)) AS cost_med
FROM
product
GROUP BY
category_small_cd
)
SELECT
product_cd,
category_major_cd,
category_medium_cd,
category_small_cd,
COALESCE(unit_price, unit_med) AS unit_price,
COALESCE(unit_cost, cost_med) AS unit_cost
FROM
product
JOIN
category_median
USING(category_small_cd)
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10030 rows affected.
[]
Commentary :
This code is written in SQL language and it performs the following steps:
ADROP TABLE
statement is executed to drop theproduct_4
table if it already exists.
ACREATE TABLE
statement is executed to create theproduct_4
table.
A common table expression (CTE) namedcategory_median
is defined. This CTE uses aSELECT
statement with anORDER BY
clause to sort the rows in theproduct
table by theunit_price
andunit_cost
columns within each category, and then calculates the median values ofunit_price
andunit_cost
for each category using thePERCENTILE_CONT
function. TheGROUP BY
clause groups the data bycategory_small_cd
column.
The mainSELECT
statement retrieves data from theproduct
table and joins it with thecategory_median
CTE on thecategory_small_cd
column. TheCOALESCE
function is used to replace null values in theunit_price
andunit_cost
columns with the median values for their corresponding categories.
Finally, the resulting data is inserted into theproduct_4
table using theCREATE TABLE AS
syntax.
In summary, this code creates a new tableproduct_4
by joining theproduct
table with a CTEcategory_median
. The median values ofunit_price
andunit_cost
are calculated for each category in theproduct
table using thePERCENTILE_CONT
function. The resulting tableproduct_4
contains data from theproduct
table with null values in theunit_price
andunit_cost
columns replaced by the median values for their corresponding categories.
%%sql
SELECT
SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
product_4
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
unit_price | unit_cost |
---|---|
0 | 0 |
Commentary :
This code is also written in SQL and it performs the following steps:
ASELECT
statement is executed to retrieve the sum of null values in theunit_price
andunit_cost
columns of theproduct_4
table.
TheCASE
statement in theSUM
function checks whether each value in theunit_price
andunit_cost
columns is null or not. If it is null, then it returns 1, otherwise, it returns 0. This way, theSUM
function calculates the total number of null values in each column.
TheAS
keyword is used to rename the columns asunit_price
andunit_cost
respectively.
TheLIMIT
clause is used to limit the number of rows returned by the query to 10.
In summary, this code calculates the total number of null values in theunit_price
andunit_cost
columns of theproduct_4
table. This can help to identify missing data in the table and determine whether further data cleaning or imputation is necessary.
S-084: Prepare new data for all customers in the customer data (df_customer) by calculating the percentage of the 2019 sales value to the sales value for the whole period. However, if there are no sales results, treat them as 0. Then, extract the calculated percentages that exceed 0 and display the results for 10 cases. Also check that there are no missing data in the data created.
%%sql
DROP TABLE IF EXISTS sales_rate;
CREATE TABLE sales_rate AS(
WITH sales_amount_2019 AS
(
SELECT
customer_id,
SUM(amount) AS sum_amount_2019
FROM
receipt
WHERE
sales_ymd BETWEEN 20190101 AND 20191231
GROUP BY
customer_id
),
sales_amount_all AS
(
SELECT
customer_id,
SUM(amount) AS sum_amount_all
FROM
receipt
GROUP BY
customer_id
)
SELECT
a.customer_id,
COALESCE(b.sum_amount_2019, 0) AS sales_amount_2019,
COALESCE(c.sum_amount_all, 0) AS sales_amount_all,
CASE COALESCE(c.sum_amount_all, 0) WHEN 0 THEN 0 ELSE COALESCE(b.sum_amount_2019, 0) * 1.0 / c.sum_amount_all END AS sales_rate
FROM
customer a
LEFT JOIN
sales_amount_2019 b
ON
a.customer_id = b.customer_id
LEFT JOIN
sales_amount_all c
ON
a.customer_id = c.customer_id);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21971 rows affected.
[]
Commentary :
This code is also written in SQL and it performs the following steps:
ASELECT
statement is executed to retrieve data from thereceipt
andcustomer
tables.
Two subqueries are created to calculate the total sales amount for each customer in the year 2019 and for all years.
TheSELECT
statement joins thecustomer
table with the two subqueries using theLEFT JOIN
operator. This ensures that all customers in thecustomer
table are included in the output, even if they have no sales data in thereceipt
table.
TheCOALESCE
function is used to handle cases where a customer has no sales data in thereceipt
table. It replacesNULL
values with0
.
TheCASE
statement calculates the sales rate for each customer. It checks if the total sales amount for all years is0
, in which case it returns0
. Otherwise, it divides the sales amount for the year 2019 by the total sales amount for all years.
The results are inserted into a new table calledsales_rate
.
In summary, this code calculates the sales rate for each customer based on their sales data from thereceipt
table. The sales rate is the ratio of the sales amount for the year 2019 to the total sales amount for all years. The results are saved in a new table calledsales_rate
.
%%sql
SELECT
*
FROM
sales_rate
WHERE
sales_rate > 0
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sales_amount_2019 | sales_amount_all | sales_rate |
---|---|---|---|
CS031415000172 | 2971 | 5088 | 0.58392295597484276730 |
CS015414000103 | 874 | 3122 | 0.27994875080076873799 |
CS011215000048 | 248 | 3444 | 0.07200929152148664344 |
CS029415000023 | 3767 | 5167 | 0.72904973872653377201 |
CS035415000029 | 5823 | 7504 | 0.77598614072494669510 |
CS023513000066 | 208 | 771 | 0.26977950713359273671 |
CS035513000134 | 463 | 1565 | 0.29584664536741214058 |
CS001515000263 | 216 | 216 | 1.00000000000000000000 |
CS006415000279 | 229 | 229 | 1.00000000000000000000 |
CS031415000106 | 215 | 7741 | 0.02777418938121689704 |
%%sql
SELECT
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN sales_amount_2019 IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN sales_amount_all IS NULL THEN 1 ELSE 0 END) AS unit_cost,
SUM(CASE WHEN sales_rate IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
sales_rate
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
unit_price | unit_price_1 | unit_cost | unit_cost_1 |
---|---|---|---|
0 | 0 | 0 | 0 |
Commentary :
This code is also written in SQL and it performs the following steps:
ASELECT
statement is executed to retrieve all rows from thesales_rate
table where thesales_rate
value is greater than0
.
TheLIMIT
clause limits the output to the first10
rows.
The results are returned as output.
In summary, this code selects all rows from thesales_rate
table where the sales rate is greater than0
, which means that the customer has made at least one sale in the year 2019. TheLIMIT
clause limits the output to the first10
rows.
S-085: Create new customer data by linking geocode data (df_geocode) to all customers in the customer data (df_customer) using postcode (postal_cd). However, if multiple longitude (longitude) and latitude (latitude) information is associated with a single postcode (postal_cd), calculate and use the average values of longitude (longitude) and latitude (latitude). Also, display 10 results to check the creation results.
%%sql
DROP TABLE IF EXISTS customer_1;
CREATE TABLE customer_1 AS (
WITH geocode_avg AS
(
SELECT
postal_cd,
AVG(longitude) AS m_longitude,
AVG(latitude) AS m_latitude
FROM
geocode
GROUP BY
postal_cd
)
SELECT
*
FROM
customer c
JOIN
geocode_avg g
USING(postal_cd)
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21971 rows affected.
[]
Commentary :
This code is written in SQL and it performs the following steps:
AWITH
clause is used to define a subquery that calculates the average longitude and latitude for each postal code using thegeocode
table.
TheSELECT
statement joins thecustomer
table with the subquery defined in theWITH
clause using thepostal_cd
column.
The result of the join operation is then inserted into a new table calledcustomer_1
using theCREATE TABLE
statement. If a table with the same name already exists, it is first dropped using theDROP TABLE IF EXISTS
statement.
In summary, this code creates a new table calledcustomer_1
by joining thecustomer
table with thegeocode
table using thepostal_cd
column and appending the average longitude and latitude for each postal code to thecustomer
table. The resulting table has all the columns from thecustomer
table along with two additional columns for the average longitude and latitude.
%%sql
SELECT
*
FROM
customer_1
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
postal_cd | customer_id | customer_name | gender_cd | gender | birth_day | age | address | application_store_cd | application_date | status_cd | m_longitude | m_latitude |
---|---|---|---|---|---|---|---|---|---|---|---|---|
136-0076 | CS037613000071 | 六角 雅彦 | 9 | 不明 | 1952-04-01 | 66 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 | 139.8350200000000000 | 35.6719300000000000 |
151-0053 | CS031415000172 | 宇多田 貴美子 | 1 | 女性 | 1976-10-04 | 42 | 東京都渋谷区代々木********** | S13031 | 20150529 | D-20100325-C | 139.6896500000000000 | 35.6737400000000000 |
245-0016 | CS028811000001 | 堀井 かおり | 1 | 女性 | 1933-03-27 | 86 | 神奈川県横浜市泉区和泉町********** | S14028 | 20160115 | 0-00000000-0 | 139.4836000000000000 | 35.3912500000000000 |
144-0055 | CS001215000145 | 田崎 美紀 | 1 | 女性 | 1995-03-29 | 24 | 東京都大田区仲六郷********** | S13001 | 20170605 | 6-20090929-2 | 139.7077500000000000 | 35.5408400000000000 |
136-0073 | CS015414000103 | 奥野 陽子 | 1 | 女性 | 1977-08-09 | 41 | 東京都江東区北砂********** | S13015 | 20150722 | B-20100609-B | 139.8360100000000000 | 35.6781800000000000 |
136-0073 | CS015804000004 | 松谷 米蔵 | 0 | 男性 | 1931-05-02 | 87 | 東京都江東区北砂********** | S13015 | 20150607 | 0-00000000-0 | 139.8360100000000000 | 35.6781800000000000 |
276-0022 | CS007403000016 | 依田 満 | 0 | 男性 | 1975-08-18 | 43 | 千葉県八千代市上高野********** | S12007 | 20150914 | 0-00000000-0 | 140.1326000000000000 | 35.7326400000000000 |
154-0015 | CS035614000014 | 板倉 菜々美 | 1 | 女性 | 1954-07-16 | 64 | 東京都世田谷区桜新町********** | S13035 | 20150804 | 0-00000000-0 | 139.6429700000000000 | 35.6288900000000000 |
223-0062 | CS011215000048 | 芦田 沙耶 | 1 | 女性 | 1992-02-01 | 27 | 神奈川県横浜市港北区日吉本町********** | S14011 | 20150228 | C-20100421-9 | 139.6294600000000000 | 35.5537000000000000 |
226-0021 | CS040412000191 | 川井 郁恵 | 1 | 女性 | 1977-01-05 | 42 | 神奈川県横浜市緑区北八朔町********** | S14040 | 20151101 | 1-20091025-4 | 139.5396400000000000 | 35.5276300000000000 |
Commentary :
This code is written in SQL and performs a simpleSELECT
statement on thecustomer_1
table.
TheSELECT
statement retrieves all the columns and rows from thecustomer_1
table using the*
wildcard character.
TheLIMIT
clause limits the number of rows returned by the query to the first 10 rows.
In summary, this code retrieves the first 10 rows of thecustomer_1
table, which contains all the columns from thecustomer
table along with two additional columns for the average longitude and latitude of each customer's postal code.
P-086: For the customer data with latitude(latitude) and longitude(longitude) created in 085, combine it with shop data (df_store) using the member application shop code (application_store_cd) as key. Calculate the distance (in km) between the application shop and the customer address using the latitude(latitude) and longitude information (longitude) of the application shop and the latitude and longitude of the customer address (address), and display it together with the customer ID (customer_id), customer address and shop address (address). The following simple formula should be used as the calculation formula, but libraries using other highly accurate methods are also acceptable. Display 10 results.
Latitude(radians):φLongitude(radians):λDistance L=6371∗arccos(sinφ1∗sinφ2+cosφ1∗cosφ2∗cos(λ1−λ2))
%%sql
SELECT
c.customer_id,
c.address AS customer_address,
s.address AS store_address,
6371 * ACOS
(
SIN(RADIANS(c.m_latitude))
* SIN(RADIANS(s.latitude))
+ COS(RADIANS(c.m_latitude))
* COS(RADIANS(s.latitude))
* COS(RADIANS(c.m_longitude) - RADIANS(s.longitude))
) AS distance
FROM
customer_1 c
JOIN
store s
ON
c.application_store_cd = s.store_cd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | customer_address | store_address | distance |
---|---|---|---|
CS037613000071 | 東京都江東区南砂********** | 東京都江東区南砂一丁目 | 1.4511822099658445 |
CS031415000172 | 東京都渋谷区代々木********** | 東京都渋谷区初台二丁目 | 0.4117334789298223 |
CS028811000001 | 神奈川県横浜市泉区和泉町********** | 神奈川県横浜市瀬谷区二ツ橋町 | 8.065196026704987 |
CS001215000145 | 東京都大田区仲六郷********** | 東京都大田区仲六郷二丁目 | 1.2684209720729687 |
CS015414000103 | 東京都江東区北砂********** | 東京都江東区南砂二丁目 | 1.449673414532165 |
CS015804000004 | 東京都江東区北砂********** | 東京都江東区南砂二丁目 | 1.449673414532165 |
CS007403000016 | 千葉県八千代市上高野********** | 千葉県佐倉市上志津 | 1.9208032538419055 |
CS035614000014 | 東京都世田谷区桜新町********** | 東京都世田谷区用賀四丁目 | 1.0214681484997588 |
CS011215000048 | 神奈川県横浜市港北区日吉本町********** | 神奈川県横浜市港北区日吉本町四丁目 | 0.8182767808775093 |
CS040412000191 | 神奈川県横浜市緑区北八朔町********** | 神奈川県横浜市緑区長津田みなみ台五丁目 | 3.6641233580823287 |
Commentary :
This SQL code retrieves data from two tables,customer_1
andstore
, and calculates the distance between each customer's address and the corresponding store address.
TheSELECT
statement retrieves the following columns:customer_id
from thecustomer_1
table.address
from thecustomer_1
table, renamed tocustomer_address
.address
from thestore
table, renamed tostore_address
.
A calculated value for the distance between the customer and the store, using the Haversine formula. TheACOS
,SIN
, andCOS
functions are used to calculate the arc-cosine of the sine and cosine of the latitudes and longitudes of both addresses, respectively. The result is then multiplied by the radius of the Earth (6371 km) to obtain the distance in kilometers.
TheJOIN
statement combines the rows from thecustomer_1
andstore
tables based on the matchingapplication_store_cd
andstore_cd
columns, respectively. TheLIMIT
clause limits the number of rows returned to 10.
S-087: In the customer data (df_customer), the same customer is registered more than once, e.g. due to applications at different shops. Create the named customer data, in which customers with the same name (customer_name) and postcode (postal_cd) are regarded as the same customer and are collated so that there is one record per customer, and calculate the number of customer data, the number of named customer data and the number of duplicates. However, for the same customer, the one with the highest total sales amount shall be retained, and for customers with the same total sales amount or no sales record, the one with the lowest customer ID (customer_id) number shall be retained.
%%sql
DROP TABLE IF EXISTS customer_u;
CREATE TABLE customer_u AS (
WITH sales_amount AS
(
SELECT
c.customer_id,
c.customer_name,
c.postal_cd,
COALESCE(SUM(r.amount), 0) AS sum_amount
FROM
customer c
LEFT JOIN
receipt r
ON
c.customer_id = r.customer_id
GROUP BY
c.customer_id,
c.customer_name,
c.postal_cd
),
sales_ranking AS
(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY customer_name, postal_cd ORDER BY sum_amount desc, customer_id ) AS ranking
FROM
sales_amount
)
SELECT
c.*
FROM
customer c
JOIN
sales_ranking r
ON
c.customer_id = r.customer_id
AND
r.ranking = 1
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21941 rows affected.
[]
Commentary :
This SQL code creates a new table named "customer_u" by joining data from two tables "customer" and "receipt". It first defines a Common Table Expression (CTE) named "sales_amount" which computes the total sales amount for each customer by joining the "customer" table with the "receipt" table using a left join. The CTE also uses the COALESCE function to return 0 for customers who have no sales.
Then another CTE named "sales_ranking" is defined by computing the sales ranking for each customer based on the total sales amount, within each customer_name and postal_cd group. The ROW_NUMBER() function is used to assign a rank to each customer based on the sum_amount in descending order.
Finally, the SELECT statement joins the "customer" table with the "sales_ranking" CTE, filtering only the rows where the ranking is 1. This effectively selects only the top-ranking customer for each customer_name and postal_cd group. The resulting data is then inserted into a new table named "customer_u".
%%sql
SELECT
customer_cnt,
customer_u_cnt,
customer_cnt - customer_u_cnt AS diff
FROM
(
SELECT
COUNT(1) AS customer_cnt
FROM
customer
) customer
CROSS JOIN
(SELECT
COUNT(1) AS customer_u_cnt
FROM
customer_u
) customer_u
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
customer_cnt | customer_u_cnt | diff |
---|---|---|
21971 | 21941 | 30 |
Commentary :
This SQL code queries the number of customers in two different tables and calculates the difference between them.
The first part of the code is a subquery that selects the total count of customers from the "customer" table and aliases it as "customer_cnt".
The second part of the code is another subquery that selects the total count of customers from the "customer_u" table and aliases it as "customer_u_cnt".
Finally, the outer query selects the "customer_cnt" and "customer_u_cnt" from the two subqueries, and calculates the difference between them as "diff". This is done by subtracting the "customer_u_cnt" from the "customer_cnt".
The result is a single row with three columns - "customer_cnt", "customer_u_cnt", and "diff". The "customer_cnt" column displays the total number of customers in the "customer" table, "customer_u_cnt" column displays the total number of customers in the "customer_u" table, and "diff" column displays the difference between these two counts.
P-088: Based on the data created in 087, create data with integrated name IDs for the customer data. However, the integrated name ID shall be assigned according to the following specifications.
- Non-duplicated customers: set customer ID (customer_id)
- Duplicate customers: set the customer ID of the record extracted in the previous question
The difference between the number of unique customer IDs and the number of unique integration name-drop IDs shall also be checked.
%%sql
DROP TABLE IF EXISTS customer_n;
CREATE TABLE customer_n AS
(
SELECT
c.*,
u.customer_id AS integration_id
FROM
customer c
JOIN
customer_u u
ON
c.customer_name = u.customer_name
AND
c.postal_cd = u.postal_cd
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21971 rows affected.
[]
Commentary :
This SQL code creates a new table calledcustomer_n
by joining the existingcustomer
andcustomer_u
tables.
TheJOIN
condition matches rows betweencustomer
andcustomer_u
tables with the same values forcustomer_name
andpostal_cd
columns. For these matching rows, theintegration_id
column of the resulting tablecustomer_n
is set to thecustomer_id
from thecustomer_u
table.
Therefore, thecustomer_n
table includes all the rows from thecustomer
table, along with an additionalintegration_id
column that contains thecustomer_id
values from thecustomer_u
table. Rows that do not have matchingcustomer_name
andpostal_cd
values in thecustomer_u
table will have aNULL
value in theintegration_id
column.
%%sql
SELECT
COUNT(1) AS ID difference in numbers
FROM
customer_n
WHERE
customer_id != integration_id;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
Difference in the number of id |
---|
30 |
Commentary :
This SQL code queries the number of rows in the "customer_n" table where the "customer_id" column is not equal to the "integration_id" column.
The "customer_n" table was created in the previous code block by joining the "customer" table with the "customer_u" table using the customer's name and postal code as the join criteria. This effectively integrated the two tables by adding a new "integration_id" column to the "customer" table, which contains the corresponding "customer_id" value from the "customer_u" table.
The current query counts the number of rows in the "customer_n" table where the "customer_id" value is not equal to the "integration_id" value. This means that there are records in the "customer" table that do not have a corresponding record in the "customer_u" table or that have a different "customer_id" value in the "customer_u" table. This can be used to identify data integration issues and investigate the discrepancies further.
S-089: You want to split customers with sales records into training data and test data for building a forecasting model. Split the data randomly in the ratio of 8:2 respectively.
%%sql
SELECT SETSEED(0.1);
CREATE TEMP TABLE IF NOT EXISTS sales_customer AS
(
SELECT
customer_id ,
ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row
FROM
customer
JOIN
receipt
USING(customer_id)
GROUP BY customer_id
HAVING SUM(AMOUNT) > 0
);
DROP TABLE IF EXISTS customer_train;
CREATE TABLE customer_train AS
SELECT
customer.*
FROM
sales_customer
JOIN
customer
USING(customer_id)
WHERE
sales_customer.row <= (SELECT
COUNT(1)
FROM
sales_customer) * 0.8
;
DROP TABLE IF EXISTS customer_test;
CREATE TABLE customer_test AS
SELECT
customer.*
FROM
sales_customer
JOIN
customer
USING(customer_id)
EXCEPT
SELECT
*
FROM
customer_train
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected. 8306 rows affected. Done. 6644 rows affected. Done. 1662 rows affected.
[]
Commentary :
This code is using SQL to create a training and testing dataset for a machine learning model based on customer sales data. Here is a breakdown of what each line does:
%%sql SELECT SETSEED(0.1);
This sets the seed for the random number generator used in the query. This is done to ensure that the same results are obtained each time the query is run.
CREATE TEMP TABLE IF NOT EXISTS sales_customer AS (
SELECT
customer_id,
ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row
FROM
customer
JOIN receipt USING(customer_id)
GROUP BY
customer_id
HAVING
SUM(AMOUNT) > 0
);
This creates a temporary table called sales_customer
that contains each customer's customer_id
and a randomly assigned row
number. It selects data from two tables: customer
and receipt
using JOIN
. It groups data by customer_id
and filters the result to only include customers who have made a purchase (SUM(AMOUNT) > 0
).
DROP TABLE IF EXISTS customer_train;
CREATE TABLE customer_train AS
SELECT customer.*
FROM sales_customer
JOIN customer USING(customer_id)
WHERE sales_customer.row <= (SELECT COUNT(1) FROM sales_customer) * 0.8 ;
This creates a table called customer_train
that contains 80% of the customers in the sales_customer
table. It selects all columns from the customer
table for customers whose row
number in the sales_customer
table is less than or equal to 80% of the total number of rows in the sales_customer
table.
DROP TABLE IF EXISTS customer_test;
CREATE TABLE customer_test AS
SELECT customer.*
FROM sales_customer
JOIN customer USING(customer_id)
EXCEPT
SELECT * FROM customer_train ;
This creates a table called customer_test
that contains the remaining 20% of the customers from the sales_customer
table. It selects all columns from the customer
table for customers whose customer_id
is in the sales_customer
table, but not in the customer_train
table. This ensures that the test set contains customers that are not in the training set.
%%sql
SELECT
train_cnt * 1.0 / all_cnt as Percentage of training data,
test_cnt * 1.0 / all_cnt as Test Data Percentage
FROM
(SELECT
COUNT(1) AS all_cnt
FROM
sales_customer) all_data
CROSS JOIN
(SELECT
COUNT(1) AS train_cnt
FROM
customer_train) train_data
CROSS JOIN
(SELECT
COUNT(1) AS test_cnt
FROM customer_test) test_data
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
Percentage of training data | Percentage of test data |
---|---|
0.79990368408379484710 | 0.20009631591620515290 |
Commentary :
This code is calculating the percentage of data in the training set and the test set, based on the tables created in the previous SQL code. Here is a breakdown of what each line does:
%%sql SELECT train_cnt * 1.0 / all_cnt as Percentage of training data, test_cnt * 1.0 / all_cnt as Test Data Percentage
This is the main query, which selects the percentage of data in the training set and the test set. It uses train_cnt
, test_cnt
, and all_cnt
variables, which are calculated in the following subqueries:
FROM
(SELECT COUNT(1) AS all_cnt FROM sales_customer) all_data
CROSS JOIN (SELECT COUNT(1) AS train_cnt FROM customer_train) train_data
CROSS JOIN (SELECT COUNT(1) AS test_cnt FROM customer_test) test_data ;
These subqueries calculate the total number of rows in the sales_customer
table (all_cnt
), the number of rows in the customer_train
table (train_cnt
), and the number of rows in the customer_test
table (test_cnt
).
The CROSS JOIN
operator is used to combine these subqueries into a single table. Since each subquery returns only one row, the result of the cross join is a single row with three columns: all_cnt
, train_cnt
, and test_cnt
.
The main query then calculates the percentage of data in the training set and the test set by dividing train_cnt
and test_cnt
by all_cnt
, respectively. The * 1.0
expression is used to convert the result to a floating-point number, which ensures that the division is done with floating-point precision.
S-090: The receipt details data (df_receipt) has data from 1 Jan 2017 to 31 Oct 2019. Create three sets of data for building time-series models, 12 months for training and 6 months for testing, by aggregating the sales amount (amount) on a monthly basis.
%%sql
-- Because it is not SQL-oriented, it is described somewhat aggressively (when the number of divisions is large, it is not realistic because SQL becomes long).
-- Also, be careful when the time series is detailed, such as data in seconds, and the data is long term, because the data is huge (in such cases, a language that can train models in a loop process is desirable).
-- (In such cases, it is desirable to use a language that enables model learning in a loop process.
-- As a preliminary step, total the sales amount by year and month, and assign sequential numbers.
CREATE TEMP TABLE IF NOT EXISTS ts_amount AS
(
SELECT
SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6) AS sales_ym,
SUM(amount) AS sum_amount,
ROW_NUMBER() OVER(ORDER BY SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6)) AS rn
FROM
receipt
GROUP BY
sales_ym
);
-- SQL has its limitations, but should be as reusable as possible as the number of datasets created increases.
-- LAG functions in WITH clauses are described in such a way that they can be used by changing the lag period.
DROP TABLE IF EXISTS series_data_1 ;
CREATE TABLE series_data_1 AS (
WITH lag_amount AS
(
SELECT
sales_ym,
sum_amount,
LAG(rn, 0) OVER (ORDER BY rn) AS rn
FROM
ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM
lag_amount
WHERE
rn BETWEEN 1 AND 18);
DROP TABLE IF EXISTS series_data_2 ;
CREATE TABLE series_data_2 AS (
WITH lag_amount AS
(
SELECT
sales_ym,
sum_amount,
LAG(rn, 6) OVER (ORDER BY rn) AS rn
FROM
ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM
lag_amount
WHERE
rn BETWEEN 1 AND 18);
DROP TABLE IF EXISTS series_data_3 ;
CREATE TABLE series_data_3 AS (
WITH lag_amount AS
(
SELECT
sales_ym,
sum_amount,
LAG(rn, 12) OVER (ORDER BY rn) AS rn
FROM
ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM
lag_amount
WHERE
rn BETWEEN 1 AND 18);
* postgresql://padawan:***@db:5432/dsdojo_db 34 rows affected. Done. 18 rows affected. Done. 18 rows affected. Done. 18 rows affected.
[]
Commentary :
This code creates three new tables (series_data_1
, series_data_2
, and series_data_3
) by applying a window function to a summary table (ts_amount
) of sales data. Here is a breakdown of what each line does:
%%sql
CREATE TEMP TABLE IF NOT EXISTS ts_amount AS
(
SELECT
SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6) AS sales_ym,
SUM(amount) AS sum_amount,
ROW_NUMBER() OVER(ORDER BY SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6)) AS rn
FROM
receipt
GROUP BY
sales_ym
);
This code creates a temporary table (ts_amount
) that summarizes the sales data by month (using the SUBSTR
function to extract the year and month from the sales_ymd
column), calculates the total amount of sales (SUM(amount)
), and assigns a row number to each row (ROW_NUMBER() OVER(...) AS rn
). The row numbers are used in the subsequent queries to join the data across multiple time periods.
DROP TABLE IF EXISTS series_data_1 ;
CREATE TABLE series_data_1 AS (
WITH lag_amount AS
(
SELECT
sales_ym,
sum_amount,
LAG(rn, 0) OVER (ORDER BY rn) AS rn
FROM
ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM
lag_amount
WHERE
rn BETWEEN 1 AND 18);
This code creates a new table (series_data_1
) by joining the sales data from the current period with the sales data from the previous period (using a window function and the LAG
function), and assigning a binary flag (test_flg
) to each row based on whether it belongs to the training set (test_flg = 0
) or the test set (test_flg = 1
). This is done by checking the row number (rn
) against a threshold value (12
), which splits the data into two parts: the first 12 rows are used for training, and the remaining 6 rows are used for testing. The resulting table has 18 rows in total. The code then repeats this process two more times, with different time periods:
DROP TABLE IF EXISTS series_data_2 ;
CREATE TABLE series_data_2 AS (
WITH lag_amount AS (
SELECT
sales_ym,
sum_amount,
LAG(rn, 6) OVER (ORDER BY rn) AS rn
FROM ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM lag_amount
WHERE rn BETWEEN 1 AND 18);
DROP TABLE IF EXISTS series_data_3 ;
CREATE TABLE series_data_3 AS (
WITH lag_amount AS (
SELECT
sales_ym,
sum_amount,
LAG(rn, 12) OVER (ORDER BY rn) AS rn
FROM ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM lag_amount
WHERE rn BETWEEN 1 AND 18);
series_data_2
uses sales data from six months ago as the reference period, while series_data_3
uses sales data from one year ago.
%%sql
-- series_data_2 and series_data_3 are omitted.
SELECT
*
FROM
series_data_1
;
* postgresql://padawan:***@db:5432/dsdojo_db 18 rows affected.
sales_ym | sum_amount | test_flg |
---|---|---|
201701 | 902056 | 0 |
201702 | 764413 | 0 |
201703 | 962945 | 0 |
201704 | 847566 | 0 |
201705 | 884010 | 0 |
201706 | 894242 | 0 |
201707 | 959205 | 0 |
201708 | 954836 | 0 |
201709 | 902037 | 0 |
201710 | 905739 | 0 |
201711 | 932157 | 0 |
201712 | 939654 | 0 |
201801 | 944509 | 1 |
201802 | 864128 | 1 |
201803 | 946588 | 1 |
201804 | 937099 | 1 |
201805 | 1004438 | 1 |
201806 | 1012329 | 1 |
Commentary :
This code is a SQL query that selects all columns and rows from the tableseries_data_1
. The tableseries_data_1
is created in the previous SQL code block using theCREATE TABLE
statement.
In the creation ofseries_data_1
, the query creates a temporary tablets_amount
which calculates the sum of amounts for each month and also assigns a row number based on the month. Using the row number and theLAG()
window function, the query creates a new tableseries_data_1
that contains the sales amount for each month, and a flag indicating whether the row is part of the training data or test data.
Finally, in this code block, the query selects all columns and rows fromseries_data_1
, which allows you to view the results of the previous SQL statements.
S-091: For each customer in the customer data (df_customer), under-sample the number of customers with sales performance and the number of customers without sales performance so that the ratio is 1:1.
%%sql
SELECT SETSEED(0.1);
CREATE TEMP TABLE IF NOT EXISTS down_sampling AS (
WITH pre_table_1 AS
(
SELECT
c.*,
COALESCE(r.sum_amount,0) AS sum_amount
FROM
customer c
LEFT JOIN
(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
GROUP BY
customer_id
) r
ON
c.customer_id=r.customer_id
)
,pre_table_2 AS(
SELECT
*
,CASE WHEN sum_amount > 0 THEN 1 ELSE 0 END AS is_buy_flag
,CASE WHEN sum_amount = 0 THEN 1 ELSE 0 END AS is_not_buy_flag
FROM
pre_table_1
)
,pre_table_3 AS
(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY is_buy_flag ORDER BY RANDOM())
FROM
pre_table_2
CROSS JOIN
(SELECT SUM(is_buy_flag) AS buying FROM pre_table_2) AS t1
CROSS JOIN
(SELECT SUM(is_not_buy_flag) AS not_buying FROM pre_table_2) AS t2
)
SELECT
*
FROM
pre_table_3
WHERE
row_number <= buying
AND row_number <= not_buying
);
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected. 16612 rows affected.
[]
Commentary :
This code is a SQL query that performs downsampling of customer data. Downsampling is a process of randomly removing some data points from a dataset to balance the classes or reduce the dataset size.
In the query, the first statement sets the random seed value. The next statement creates a temporary table calleddown_sampling
that includes the customer data and the sum of amounts for each customer. TheLEFT JOIN
clause in the first CTE combines the customer table with the receipt table, calculating the sum of amounts for each customer.
In the second CTE, two additional columns are created using theCASE
statement. These columns are flags that indicate whether a customer has made a purchase or not. Then, in the third CTE, theROW_NUMBER()
window function is used to assign a row number to each customer record based on theis_buy_flag
column.
Finally, in the last statement of the query, the temporary tabledown_sampling
is selected, but only the rows where the row number is less than or equal to the number of buying customers and the number of non-buying customers. This ensures that the number of rows for each class is balanced in the resulting table.
%%sql
SELECT
is_buy_flag,
COUNT(1)
FROM
down_sampling
GROUP BY
is_buy_flag
;
* postgresql://padawan:***@db:5432/dsdojo_db 2 rows affected.
is_buy_flag | count |
---|---|
0 | 8306 |
1 | 8306 |
Commentary :
This code runs a SQL query that groups the records in thedown_sampling
temporary table by theiris_buy_flag
column and counts the number of records in each group.
Thedown_sampling
table was created in the previous code cell by performing a left join between thecustomer
table and a subquery that aggregates theamount
column from thereceipt
table bycustomer_id
. The resulting table has columns for customer information as well as the total sum of amounts purchased by each customer.
Thepre_table_2
subquery creates two new columns based on thesum_amount
column. Theis_buy_flag
column is set to 1 for customers who have made at least one purchase and 0 for customers who have not made any purchases. Theis_not_buy_flag
column is set to 1 for customers who have not made any purchases and 0 for customers who have made at least one purchase.
Thepre_table_3
subquery generates a row number for each row in thepre_table_2
table based on theis_buy_flag
column and a random order. It also calculates the total number of rows in thepre_table_2
table whereis_buy_flag
is 1 and where it is 0 using cross joins with two separate subqueries.
Finally, the mainSELECT
statement selects all rows frompre_table_3
where the row number is less than or equal to the total number of rows whereis_buy_flag
is 1 and where it is 0. This ensures that there are an equal number of records for each value ofis_buy_flag
.
Therefore, the output of theSELECT
statement shows the count of records in thedown_sampling
table whereis_buy_flag
is 0 and where it is 1.
S-092: Normalise to third normal form for gender in customer data (df_customer).
%%sql
DROP TABLE IF EXISTS customer_std;
CREATE TABLE customer_std AS
(
SELECT
customer_id,
customer_name,
gender_cd,
birth_day,
age,
postal_cd,
application_store_cd,
application_date,
status_cd
FROM
customer
);
DROP TABLE IF EXISTS gender_std;
CREATE TABLE gender_std AS
(
SELECT DISTINCT
gender_cd,
gender
FROM
customer
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21971 rows affected. Done. 3 rows affected.
[]
Commentary :
This SQL code creates two new tables from an existing table namedcustomer
.
The first part of the code drops a table namedcustomer_std
if it exists and then creates a new table namedcustomer_std
. The new table has the same columns as thecustomer
table, which arecustomer_id
,customer_name
,gender_cd
,birth_day
,age
,postal_cd
,application_store_cd
,application_date
, andstatus_cd
. The purpose of creating this new table is not clear from the code provided. It might be created to standardize or preprocess the data.
The second part of the code drops a table namedgender_std
if it exists and then creates a new table namedgender_std
. The new table has two columns namedgender_cd
andgender
. Thegender_cd
column is a distinct list of all gender codes found in thecustomer
table, while thegender
column is the corresponding gender description. This table might be used for mapping the gender codes to their respective gender descriptions.
%%sql
-- Data content checks
SELECT
*
FROM
customer_std
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
customer_id | customer_name | gender_cd | birth_day | age | postal_cd | application_store_cd | application_date | status_cd |
---|---|---|---|---|---|---|---|---|
CS021313000114 | 大野 あや子 | 1 | 1981-04-29 | 37 | 259-1113 | S14021 | 20150905 | 0-00000000-0 |
CS037613000071 | 六角 雅彦 | 9 | 1952-04-01 | 66 | 136-0076 | S13037 | 20150414 | 0-00000000-0 |
CS031415000172 | 宇多田 貴美子 | 1 | 1976-10-04 | 42 | 151-0053 | S13031 | 20150529 | D-20100325-C |
Commentary :
This code selects the first 3 rows from thecustomer_std
table, which was created earlier in the code. Thecustomer_std
table is a copy of thecustomer
table with a subset of columns, so this code is simply displaying the values of those columns for the first three customers in the table. This is a basic check to confirm that the data has been correctly imported into the new table.
%%sql
-- Data content checks
SELECT
*
FROM
gender_std
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
gender_cd | gender |
---|---|
0 | 男性 |
9 | 不明 |
1 | 女性 |
Commentary :
This code executes an SQL query to select the first 3 rows from thegender_std
table that was created in the previous code block.
TheLIMIT
clause is used to restrict the number of rows returned by the query to 3. Since thegender_std
table was created using aSELECT DISTINCT
statement, it will only contain unique values ofgender_cd
and their correspondinggender
values from thecustomer
table. Therefore, this query will return the first 3 unique rows from thegender_std
table.
The output of this query will show thegender_cd
andgender
values for the first 3 rows of thegender_std
table.
S-093: Product data (df_product) only holds the code values for each category, but not the category names. Combine with category data (df_category) and denormalise to create new product data that holds category names.
%%sql
DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full AS
(
SELECT
p.product_cd,
p.category_major_cd,
c.category_major_name,
p.category_medium_cd,
c.category_medium_name,
p.category_small_cd,
c.category_small_name,
p.unit_price,
p.unit_cost
FROM
product p
JOIN
category c
USING(category_small_cd)
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10030 rows affected.
[]
Commentary :
This code creates a new table called "product_full" by joining two tables "product" and "category".
The "product" table contains information about each product, such as its product code, category codes, unit price, and unit cost.
The "category" table contains category codes and category names for each category level.
The "JOIN" clause connects the two tables based on the category_small_cd column. By using the "USING" keyword, the query matches rows where the values in category_small_cd are equal in both tables.
The new table "product_full" includes all the columns from the "product" table, plus additional columns for category names, which are obtained by looking up the category codes in the "category" table.
%%sql
-- Data content checks
SELECT
*
FROM
product_full
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
product_cd | category_major_cd | category_major_name | category_medium_cd | category_medium_name | category_small_cd | category_small_name | unit_price | unit_cost |
---|---|---|---|---|---|---|---|---|
P040101001 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 198 | 149 |
P040101002 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 218 | 164 |
P040101003 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 230 | 173 |
Commentary :
This code selects data from two tables,product
andcategory
, and creates a new table calledproduct_full
.
TheJOIN
clause is used to join the two tables based on a common column,category_small_cd
. This means that the resulting table will have only those rows where the value incategory_small_cd
is the same in both tables.
The columns selected from the two tables are:product_cd
,category_major_cd
,category_major_name
,category_medium_cd
,category_medium_name
,category_small_cd
,category_small_name
,unit_price
, andunit_cost
.
The resulting tableproduct_full
will contain these columns for every row that meets the join criteria.
TheLIMIT
clause is used to restrict the number of rows displayed in the output to just 3.
S-094: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|UTF-8|
The path to the file output destination should be as follows (COPY command authorisation granted).
|Output destination| |:–:| |/tmp/data|
- By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.
%%sql
COPY product_full TO '/tmp/data/S_product_full_UTF-8_header.csv'
WITH CSV HEADER ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code exports the contents of theproduct_full
table to a CSV file named "S_product_full_UTF-8_header.csv" located in the "/tmp/data" directory. TheCOPY
command is a PostgreSQL command that is used to copy data between files and tables. In this case, theTO
keyword is used to specify the file to which the data is copied. TheCSV
keyword specifies the file format as comma-separated values, and theHEADER
keyword specifies that the first row of the file should contain the column headers. Finally, theENCODING
keyword is used to specify the character encoding of the output file as UTF-8.
S-095: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|CP932|
In PostgreSQL, specifying Shift_JIS is equivalent to CP932. The path to the file output destination should be as follows (COPY command permissions have been granted).
|Output destination| |:–:| |/tmp/data|
- By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.
%%sql
COPY product_full TO '/tmp/data/S_product_full_SJIS_header.csv'
WITH CSV HEADER ENCODING 'SJIS';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code is using theCOPY
command in PostgreSQL to export the contents of theproduct_full
table to a CSV file namedS_product_full_SJIS_header.csv
. TheWITH CSV HEADER
option specifies that the output should include a header row with column names, and theENCODING 'SJIS'
option specifies that the output file should be encoded using Shift-JIS character encoding.
Shift-JIS is a character encoding commonly used in Japan, and is one of several possible character encodings that can be used for CSV files. By exporting the data to a CSV file with Shift-JIS encoding, it allows the file to be easily imported into other applications or databases that require this specific encoding.
S-096: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|Without|UTF-8|
The path to the file output destination should be as follows (COPY command authorisation granted).
|Output destination| |:–:| |/tmp/data|
- By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.
%%sql
COPY product_full TO '/tmp/data/S_product_full_UTF-8_noh.csv'
WITH CSV ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code exports the contents of theproduct_full
table to a CSV file namedS_product_full_UTF-8_noh.csv
in the/tmp/data/
directory without including the header row in the output file.
TheCOPY
command is used in Postgres to copy data from/to a file. In this case, theproduct_full
table is being copied to a file. TheWITH CSV
option specifies that the file is in comma-separated value (CSV) format. TheENCODING
option specifies the character encoding of the file. TheUTF-8
value indicates that the file uses UTF-8 encoding, which supports a wide range of characters from different languages.
Since theHEADER
option is not specified, the output file does not include the header row. Therefore, the output file will only contain the rows of data from theproduct_full
table in CSV format.
S-097: Load the file in the following format created in 094, display three data items and check that they have been imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|
%%sql
DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full
(
product_cd VARCHAR(10),
category_major_cd VARCHAR(2),
category_major_name VARCHAR(20),
category_medium_cd VARCHAR(4),
category_medium_name VARCHAR(20),
category_small_cd VARCHAR(6),
category_small_name VARCHAR(20),
unit_price INTEGER,
unit_cost INTEGER
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. Done.
[]
Commentary :
This SQL code drops the tableproduct_full
if it exists and creates a new table with the same name. The new table has eight columns:product_cd
,category_major_cd
,category_major_name
,category_medium_cd
,category_medium_name
,category_small_cd
,category_small_name
,unit_price
, andunit_cost
. The first seven columns are of typeVARCHAR
with varying lengths, while the last two columns are of typeINTEGER
.
By specifying the data types and lengths of the columns, the code is defining the schema of theproduct_full
table. This ensures that each column has a specific data type and length, which can improve the performance of queries on this table, as well as prevent errors that may occur if data of the wrong type or length is inserted.
%%sql
COPY product_full FROM '/tmp/data/S_product_full_UTF-8_header.csv'
WITH CSV HEADER ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code imports data from a CSV file located at '/tmp/data/S_product_full_UTF-8_header.csv' into a new table called 'product_full' that has already been created in the database.
TheCOPY
command in PostgreSQL allows you to copy data from a file to a table or vice versa. TheFROM
keyword is used to specify the path to the CSV file, and theWITH
keyword is used to provide options for theCOPY
command. In this case,CSV
specifies that the file is in Comma-Separated Value format,HEADER
specifies that the first row of the file contains column names, andENCODING 'UTF-8'
specifies the character encoding used in the file.
Since the table 'product_full' has already been created in the database, theCOPY
command inserts the data from the CSV file into the existing table.
%%sql
SELECT
*
FROM
product_full
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
product_cd | category_major_cd | category_major_name | category_medium_cd | category_medium_name | category_small_cd | category_small_name | unit_price | unit_cost |
---|---|---|---|---|---|---|---|---|
P040101001 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 198 | 149 |
P040101002 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 218 | 164 |
P040101003 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 230 | 173 |
Commentary :
This SQL code is a simple SELECT statement that retrieves the first three rows of data from the product_full table. The data was previously loaded into the table using the COPY command from a CSV file. The query is using the LIMIT clause to limit the number of rows returned to 3.
The SELECT statement retrieves all columns from the product_full table. The table contains information about products, including their codes, categories, and prices. The data is organized into several columns, including product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, and unit_cost.
S-098: Load the file in the following format created in 096, display three data items and check that they are imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|Without|UTF-8|
%%sql
DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full
(
product_cd VARCHAR(10),
category_major_cd VARCHAR(2),
category_major_name VARCHAR(20),
category_medium_cd VARCHAR(4),
category_medium_name VARCHAR(20),
category_small_cd VARCHAR(6),
category_small_name VARCHAR(20),
unit_price INTEGER,
unit_cost INTEGER
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. Done.
[]
Commentary :
This code drops the table namedproduct_full
if it exists, and creates a new table with the same name and schema. The new table has 9 columns:product_cd
column is of type VARCHAR(10)category_major_cd
column is of type VARCHAR(2)category_major_name
column is of type VARCHAR(20)category_medium_cd
column is of type VARCHAR(4)category_medium_name
column is of type VARCHAR(20)category_small_cd
column is of type VARCHAR(6)category_small_name
column is of type VARCHAR(20)unit_price
column is of type INTEGERunit_cost
column is of type INTEGER
TheVARCHAR
data type is used to store variable-length character strings with the specified maximum length. TheINTEGER
data type is used to store whole numbers.
This code creates an empty table with the specified schema, and no data is added to the table.
%%sql
COPY product_full FROM '/tmp/data/S_product_full_UTF-8_noh.csv'
WITH CSV ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code is importing data from a CSV file named 'S_product_full_UTF-8_noh.csv' into the table 'product_full' using the COPY command in PostgreSQL.
The WITH CSV option specifies that the data is in CSV format. The ENCODING option specifies the character encoding of the file, which is UTF-8 in this case.
Since the file doesn't have a header row, the column names and types are specified explicitly in a preceding SQL statement which creates the 'product_full' table.
Once the data is imported into the table, the SELECT statement is used to retrieve the first 3 rows of the 'product_full' table to confirm that the data has been imported correctly.
%%sql
SELECT
*
FROM
product_full
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
product_cd | category_major_cd | category_major_name | category_medium_cd | category_medium_name | category_small_cd | category_small_name | unit_price | unit_cost |
---|---|---|---|---|---|---|---|---|
P040101001 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 198 | 149 |
P040101002 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 218 | 164 |
P040101003 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 230 | 173 |
Commentary :
This code selects the first 3 rows from the "product_full" table using SQL SELECT statement. The SELECT statement retrieves all columns (product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, unit_cost) from the "product_full" table and limits the output to the first 3 rows using the LIMIT keyword. The result shows the first 3 rows of the "product_full" table.
S-099: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |TSV (Tab Separated Values)|With|UTF-8|
The path to the file output destination should be as follows (COPY command authorisation has been granted).
|Output destination| |:–:| |/tmp/data|
- By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.
%%sql
COPY product_full TO '/tmp/data/S_product_full_UTF-8_header.tsv'
WITH CSV HEADER DELIMITER E'\t' ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code exports the contents of theproduct_full
table in CSV format, with a tab (\t
) used as a delimiter instead of a comma. The exported file is saved in the/tmp/data
directory and namedS_product_full_UTF-8_header.tsv
. TheHEADER
option is included, so the first row of the file will contain the column headers.
Here's a breakdown of the individual components of the code:COPY product_full
: specifies that the data from theproduct_full
table will be copied to a file.TO '/tmp/data/S_product_full_UTF-8_header.tsv'
: specifies the file path and name of the exported file.WITH CSV
: specifies that the exported file will be in CSV format.HEADER
: specifies that the first row of the file will contain the column headers.DELIMITER E'\t'
: specifies that the tab character will be used as a delimiter.ENCODING 'UTF-8'
: specifies the character encoding to use when exporting the file.
S-100: Load the file in the following format created in 099, display three items of data and check that they have been imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|
%%sql
DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full
(
product_cd VARCHAR(10),
category_major_cd VARCHAR(2),
category_major_name VARCHAR(20),
category_medium_cd VARCHAR(4),
category_medium_name VARCHAR(20),
category_small_cd VARCHAR(6),
category_small_name VARCHAR(20),
unit_price INTEGER,
unit_cost INTEGER
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. Done.
[]
Commentary :
This SQL code block first drops the table namedproduct_full
if it exists, and then creates a new table with the same name and schema.
The schema of theproduct_full
table has 9 columns with the following data types:product_cd
:VARCHAR(10)
(i.e., a variable-length character string with a maximum length of 10)category_major_cd
:VARCHAR(2)
category_major_name
:VARCHAR(20)
category_medium_cd
:VARCHAR(4)
category_medium_name
:VARCHAR(20)
category_small_cd
:VARCHAR(6)
category_small_name
:VARCHAR(20)
unit_price
:INTEGER
unit_cost
:INTEGER
The purpose of this code block is to define the structure of theproduct_full
table before data is loaded into it.
%%sql
COPY product_full FROM '/tmp/data/S_product_full_UTF-8_header.tsv'
WITH CSV HEADER DELIMITER E'\t' ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code copies data from a TSV (tab-separated values) file located at '/tmp/data/S_product_full_UTF-8_header.tsv' and inserts it into the 'product_full' table in the current database.
The COPY command specifies that the input file is a CSV file, with headers included in the first row of the file, and that the delimiter used is a tab character ('\t'). The ENCODING parameter specifies that the file is encoded in UTF-8.
This command assumes that the 'product_full' table already exists and has the same structure as the data in the input file (product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, unit_cost). The command will insert the data from the file into the table, creating new rows in the table for each row of data in the file.
%%sql
SELECT
*
FROM
product_full
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
product_cd | category_major_cd | category_major_name | category_medium_cd | category_medium_name | category_small_cd | category_small_name | unit_price | unit_cost |
---|---|---|---|---|---|---|---|---|
P040101001 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 198 | 149 |
P040101002 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 218 | 164 |
P040101003 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 230 | 173 |
Commentary :
This code executes a SQL query to select the first three rows from theproduct_full
table. TheSELECT
statement is followed by*
, which means to select all columns from the table.LIMIT 3
limits the result set to the first three rows of the table.
The result will display the data in theproduct_full
table according to the columns specified in the table. The exact output will depend on the data stored in the table.
This is the end of the 100. Well done!
Comment