参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- The database is PostgreSQL13
- First, execute the following cell
- You can issue SQL by writing %%sql in the cell
- You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
- You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
- It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
- If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
- Names, addresses etc. are dummy data and do not exist.
%load_ext sql
import os
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)
# Set up Magic commands to write SQL
%sql $dsl
'Connected: padawan@dsdojo_db'
How to use
- You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.
%%sql
SELECT 'You can run it like this' AS sample;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| sample |
|---|
| You can run it like this |
100 data processing knocks
S-081: Create new product data for the missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the average value of each. Note that for the average values, round off values less than one yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
%%sql
DROP TABLE IF EXISTS product_2;
CREATE TABLE product_2 AS (
SELECT
product_cd,
category_major_cd,
category_medium_cd,
category_small_cd,
COALESCE(unit_price, unit_avg) AS unit_price,
COALESCE(unit_cost, cost_avg) AS unit_cost
FROM
product
CROSS JOIN
(
SELECT
ROUND(AVG(unit_price)) AS unit_avg,
ROUND(AVG(unit_cost)) AS cost_avg
FROM
product
) stats_product
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10030 rows affected.
[]
Commentary :
This code is written in SQL language and it performs the following steps:
If a table namedproduct_2already exists in the database, it will be dropped. This is done to ensure that we start with a clean slate.
A new table namedproduct_2is created using theCREATE TABLEstatement.
The data for the new table is derived from theproducttable using aSELECTstatement.
In theSELECTstatement, the columnsproduct_cd,category_major_cd,category_medium_cd, andcategory_small_cdare selected from theproducttable.
Two new columns are created in theSELECTstatement using theCOALESCEfunction.COALESCEreturns the first non-null value in a list of expressions. In this case, ifunit_priceis null, the average unit price of all products (unit_avg) is used instead. Similarly, ifunit_costis null, the average unit cost of all products (cost_avg) is used instead.
TheFROMclause in theSELECTstatement performs aCROSS JOINwith a subquery that calculates the average unit price and unit cost of all products in theproducttable. TheROUNDfunction is used to round the average values to the nearest integer.
The result of theSELECTstatement is used to populate theproduct_2table using theINSERT INTOstatement that is implicitly executed by theCREATE TABLEstatement.
In summary, the code creates a new table namedproduct_2that contains the same columns as theproducttable, but with null values in theunit_priceandunit_costcolumns replaced by the average unit price and unit cost values, respectively, calculated across all products.
%%sql
SELECT
SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
product_2
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| unit_price | unit_cost |
|---|---|
| 0 | 0 |
Commentary :
This code is written in SQL language and it performs the following steps:
ASELECTstatement is executed to query theproduct_2table.
TheSUMfunction is used in combination with aCASEexpression to count the number of null values in theunit_priceandunit_costcolumns separately.
In eachCASEexpression,WHEN unit_price IS NULLorWHEN unit_cost IS NULLchecks if the value of the respective column is null.
If the column has a null value, theCASEexpression evaluates to 1, otherwise it evaluates to 0.
TheSUMfunction then sums up the 1's and 0's from each row to calculate the total number of null values in theunit_priceandunit_costcolumns, respectively.
TheASkeyword is used to give the resulting columns in the result set meaningful names, i.e.,unit_priceandunit_cost.
Finally, theLIMITclause is used to restrict the output to the first 10 rows.
In summary, this code returns a result set that displays the number of null values in theunit_priceandunit_costcolumns of theproduct_2table. The output indicates how many rows have missing data in these columns, which can be useful in identifying data quality issues that need to be addressed.
S-082: Create new product data for missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the median value of each. Note that for the median value, round off the figures to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
%%sql
DROP TABLE IF EXISTS product_3;
CREATE TABLE product_3 AS
(
SELECT
product_cd,
category_major_cd,
category_medium_cd,
category_small_cd,
COALESCE(unit_price, unit_med) AS unit_price,
COALESCE(unit_cost, cost_med) AS unit_cost
FROM
product
CROSS JOIN (
SELECT
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_price)) AS unit_med,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_cost)) AS cost_med
FROM
product
) stats_product
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10030 rows affected.
[]
Commentary :
This code is written in SQL language and it performs the following steps:
If a table namedproduct_3already exists in the database, it will be dropped. This is done to ensure that we start with a clean slate.
A new table namedproduct_3is created using theCREATE TABLEstatement.
The data for the new table is derived from theproducttable using aSELECTstatement.
In theSELECTstatement, the columnsproduct_cd,category_major_cd,category_medium_cd, andcategory_small_cdare selected from theproducttable.
Two new columns are created in theSELECTstatement using theCOALESCEfunction.COALESCEreturns the first non-null value in a list of expressions. In this case, ifunit_priceis null, the median unit price of all products (unit_med) is used instead. Similarly, ifunit_costis null, the median unit cost of all products (cost_med) is used instead.
TheFROMclause in theSELECTstatement performs aCROSS JOINwith a subquery that calculates the median unit price and unit cost of all products in theproducttable. ThePERCENTILE_CONTfunction is used to calculate the median values. TheWITHIN GROUPclause specifies the ordering of the rows, so that the function knows which values to use to calculate the median.
TheROUNDfunction is used to round the median values to the nearest integer.
The result of theSELECTstatement is used to populate theproduct_3table using theINSERT INTOstatement that is implicitly executed by theCREATE TABLEstatement.
In summary, the code creates a new table namedproduct_3that contains the same columns as theproducttable, but with null values in theunit_priceandunit_costcolumns replaced by the median unit price and unit cost values, respectively, calculated across all products. The median values are calculated using thePERCENTILE_CONTfunction with aWITHIN GROUPclause that specifies the ordering of the rows.
%%sql
SELECT
SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
product_3
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| unit_price | unit_cost |
|---|---|
| 0 | 0 |
Commentary :
This code is written in SQL language and it performs the following steps:
ASELECTstatement is executed to query theproduct_3table.
TheSUMfunction is used in combination with aCASEexpression to count the number of null values in theunit_priceandunit_costcolumns separately.
In eachCASEexpression,WHEN unit_price IS NULLorWHEN unit_cost IS NULLchecks if the value of the respective column is null.
If the column has a null value, theCASEexpression evaluates to 1, otherwise it evaluates to 0.
TheSUMfunction then sums up the 1's and 0's from each row to calculate the total number of null values in theunit_priceandunit_costcolumns, respectively.
TheASkeyword is used to give the resulting columns in the result set meaningful names, i.e.,unit_priceandunit_cost.
Finally, theLIMITclause is used to restrict the output to the first 10 rows.
In summary, this code returns a result set that displays the number of null values in theunit_priceandunit_costcolumns of theproduct_3table. The output indicates how many rows have missing data in these columns, which can be useful in identifying data quality issues that need to be addressed. Theproduct_3table was created in a previous step, with null values in theunit_priceandunit_costcolumns replaced by the median unit price and unit cost values, respectively, calculated across all products.
S-083: Create new product data for the missing values of unit price(unit_price) and unit cost(unit cost), supplemented by the median value calculated for each product category sub-category code (category_small_cd). Note that the median value should be rounded to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no deficiencies for each item.
%%sql
DROP TABLE IF EXISTS product_4;
CREATE TABLE product_4 AS (
WITH category_median AS
(
SELECT
category_small_cd,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_price)) AS unit_med,
ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_cost)) AS cost_med
FROM
product
GROUP BY
category_small_cd
)
SELECT
product_cd,
category_major_cd,
category_medium_cd,
category_small_cd,
COALESCE(unit_price, unit_med) AS unit_price,
COALESCE(unit_cost, cost_med) AS unit_cost
FROM
product
JOIN
category_median
USING(category_small_cd)
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10030 rows affected.
[]
Commentary :
This code is written in SQL language and it performs the following steps:
ADROP TABLEstatement is executed to drop theproduct_4table if it already exists.
ACREATE TABLEstatement is executed to create theproduct_4table.
A common table expression (CTE) namedcategory_medianis defined. This CTE uses aSELECTstatement with anORDER BYclause to sort the rows in theproducttable by theunit_priceandunit_costcolumns within each category, and then calculates the median values ofunit_priceandunit_costfor each category using thePERCENTILE_CONTfunction. TheGROUP BYclause groups the data bycategory_small_cdcolumn.
The mainSELECTstatement retrieves data from theproducttable and joins it with thecategory_medianCTE on thecategory_small_cdcolumn. TheCOALESCEfunction is used to replace null values in theunit_priceandunit_costcolumns with the median values for their corresponding categories.
Finally, the resulting data is inserted into theproduct_4table using theCREATE TABLE ASsyntax.
In summary, this code creates a new tableproduct_4by joining theproducttable with a CTEcategory_median. The median values ofunit_priceandunit_costare calculated for each category in theproducttable using thePERCENTILE_CONTfunction. The resulting tableproduct_4contains data from theproducttable with null values in theunit_priceandunit_costcolumns replaced by the median values for their corresponding categories.
%%sql
SELECT
SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
product_4
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| unit_price | unit_cost |
|---|---|
| 0 | 0 |
Commentary :
This code is also written in SQL and it performs the following steps:
ASELECTstatement is executed to retrieve the sum of null values in theunit_priceandunit_costcolumns of theproduct_4table.
TheCASEstatement in theSUMfunction checks whether each value in theunit_priceandunit_costcolumns is null or not. If it is null, then it returns 1, otherwise, it returns 0. This way, theSUMfunction calculates the total number of null values in each column.
TheASkeyword is used to rename the columns asunit_priceandunit_costrespectively.
TheLIMITclause is used to limit the number of rows returned by the query to 10.
In summary, this code calculates the total number of null values in theunit_priceandunit_costcolumns of theproduct_4table. This can help to identify missing data in the table and determine whether further data cleaning or imputation is necessary.
S-084: Prepare new data for all customers in the customer data (df_customer) by calculating the percentage of the 2019 sales value to the sales value for the whole period. However, if there are no sales results, treat them as 0. Then, extract the calculated percentages that exceed 0 and display the results for 10 cases. Also check that there are no missing data in the data created.
%%sql
DROP TABLE IF EXISTS sales_rate;
CREATE TABLE sales_rate AS(
WITH sales_amount_2019 AS
(
SELECT
customer_id,
SUM(amount) AS sum_amount_2019
FROM
receipt
WHERE
sales_ymd BETWEEN 20190101 AND 20191231
GROUP BY
customer_id
),
sales_amount_all AS
(
SELECT
customer_id,
SUM(amount) AS sum_amount_all
FROM
receipt
GROUP BY
customer_id
)
SELECT
a.customer_id,
COALESCE(b.sum_amount_2019, 0) AS sales_amount_2019,
COALESCE(c.sum_amount_all, 0) AS sales_amount_all,
CASE COALESCE(c.sum_amount_all, 0) WHEN 0 THEN 0 ELSE COALESCE(b.sum_amount_2019, 0) * 1.0 / c.sum_amount_all END AS sales_rate
FROM
customer a
LEFT JOIN
sales_amount_2019 b
ON
a.customer_id = b.customer_id
LEFT JOIN
sales_amount_all c
ON
a.customer_id = c.customer_id);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21971 rows affected.
[]
Commentary :
This code is also written in SQL and it performs the following steps:
ASELECTstatement is executed to retrieve data from thereceiptandcustomertables.
Two subqueries are created to calculate the total sales amount for each customer in the year 2019 and for all years.
TheSELECTstatement joins thecustomertable with the two subqueries using theLEFT JOINoperator. This ensures that all customers in thecustomertable are included in the output, even if they have no sales data in thereceipttable.
TheCOALESCEfunction is used to handle cases where a customer has no sales data in thereceipttable. It replacesNULLvalues with0.
TheCASEstatement calculates the sales rate for each customer. It checks if the total sales amount for all years is0, in which case it returns0. Otherwise, it divides the sales amount for the year 2019 by the total sales amount for all years.
The results are inserted into a new table calledsales_rate.
In summary, this code calculates the sales rate for each customer based on their sales data from thereceipttable. The sales rate is the ratio of the sales amount for the year 2019 to the total sales amount for all years. The results are saved in a new table calledsales_rate.
%%sql
SELECT
*
FROM
sales_rate
WHERE
sales_rate > 0
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | sales_amount_2019 | sales_amount_all | sales_rate |
|---|---|---|---|
| CS031415000172 | 2971 | 5088 | 0.58392295597484276730 |
| CS015414000103 | 874 | 3122 | 0.27994875080076873799 |
| CS011215000048 | 248 | 3444 | 0.07200929152148664344 |
| CS029415000023 | 3767 | 5167 | 0.72904973872653377201 |
| CS035415000029 | 5823 | 7504 | 0.77598614072494669510 |
| CS023513000066 | 208 | 771 | 0.26977950713359273671 |
| CS035513000134 | 463 | 1565 | 0.29584664536741214058 |
| CS001515000263 | 216 | 216 | 1.00000000000000000000 |
| CS006415000279 | 229 | 229 | 1.00000000000000000000 |
| CS031415000106 | 215 | 7741 | 0.02777418938121689704 |
%%sql
SELECT
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN sales_amount_2019 IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN sales_amount_all IS NULL THEN 1 ELSE 0 END) AS unit_cost,
SUM(CASE WHEN sales_rate IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
sales_rate
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| unit_price | unit_price_1 | unit_cost | unit_cost_1 |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
Commentary :
This code is also written in SQL and it performs the following steps:
ASELECTstatement is executed to retrieve all rows from thesales_ratetable where thesales_ratevalue is greater than0.
TheLIMITclause limits the output to the first10rows.
The results are returned as output.
In summary, this code selects all rows from thesales_ratetable where the sales rate is greater than0, which means that the customer has made at least one sale in the year 2019. TheLIMITclause limits the output to the first10rows.
S-085: Create new customer data by linking geocode data (df_geocode) to all customers in the customer data (df_customer) using postcode (postal_cd). However, if multiple longitude (longitude) and latitude (latitude) information is associated with a single postcode (postal_cd), calculate and use the average values of longitude (longitude) and latitude (latitude). Also, display 10 results to check the creation results.
%%sql
DROP TABLE IF EXISTS customer_1;
CREATE TABLE customer_1 AS (
WITH geocode_avg AS
(
SELECT
postal_cd,
AVG(longitude) AS m_longitude,
AVG(latitude) AS m_latitude
FROM
geocode
GROUP BY
postal_cd
)
SELECT
*
FROM
customer c
JOIN
geocode_avg g
USING(postal_cd)
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21971 rows affected.
[]
Commentary :
This code is written in SQL and it performs the following steps:
AWITHclause is used to define a subquery that calculates the average longitude and latitude for each postal code using thegeocodetable.
TheSELECTstatement joins thecustomertable with the subquery defined in theWITHclause using thepostal_cdcolumn.
The result of the join operation is then inserted into a new table calledcustomer_1using theCREATE TABLEstatement. If a table with the same name already exists, it is first dropped using theDROP TABLE IF EXISTSstatement.
In summary, this code creates a new table calledcustomer_1by joining thecustomertable with thegeocodetable using thepostal_cdcolumn and appending the average longitude and latitude for each postal code to thecustomertable. The resulting table has all the columns from thecustomertable along with two additional columns for the average longitude and latitude.
%%sql
SELECT
*
FROM
customer_1
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| postal_cd | customer_id | customer_name | gender_cd | gender | birth_day | age | address | application_store_cd | application_date | status_cd | m_longitude | m_latitude |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 136-0076 | CS037613000071 | 六角 雅彦 | 9 | 不明 | 1952-04-01 | 66 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 | 139.8350200000000000 | 35.6719300000000000 |
| 151-0053 | CS031415000172 | 宇多田 貴美子 | 1 | 女性 | 1976-10-04 | 42 | 東京都渋谷区代々木********** | S13031 | 20150529 | D-20100325-C | 139.6896500000000000 | 35.6737400000000000 |
| 245-0016 | CS028811000001 | 堀井 かおり | 1 | 女性 | 1933-03-27 | 86 | 神奈川県横浜市泉区和泉町********** | S14028 | 20160115 | 0-00000000-0 | 139.4836000000000000 | 35.3912500000000000 |
| 144-0055 | CS001215000145 | 田崎 美紀 | 1 | 女性 | 1995-03-29 | 24 | 東京都大田区仲六郷********** | S13001 | 20170605 | 6-20090929-2 | 139.7077500000000000 | 35.5408400000000000 |
| 136-0073 | CS015414000103 | 奥野 陽子 | 1 | 女性 | 1977-08-09 | 41 | 東京都江東区北砂********** | S13015 | 20150722 | B-20100609-B | 139.8360100000000000 | 35.6781800000000000 |
| 136-0073 | CS015804000004 | 松谷 米蔵 | 0 | 男性 | 1931-05-02 | 87 | 東京都江東区北砂********** | S13015 | 20150607 | 0-00000000-0 | 139.8360100000000000 | 35.6781800000000000 |
| 276-0022 | CS007403000016 | 依田 満 | 0 | 男性 | 1975-08-18 | 43 | 千葉県八千代市上高野********** | S12007 | 20150914 | 0-00000000-0 | 140.1326000000000000 | 35.7326400000000000 |
| 154-0015 | CS035614000014 | 板倉 菜々美 | 1 | 女性 | 1954-07-16 | 64 | 東京都世田谷区桜新町********** | S13035 | 20150804 | 0-00000000-0 | 139.6429700000000000 | 35.6288900000000000 |
| 223-0062 | CS011215000048 | 芦田 沙耶 | 1 | 女性 | 1992-02-01 | 27 | 神奈川県横浜市港北区日吉本町********** | S14011 | 20150228 | C-20100421-9 | 139.6294600000000000 | 35.5537000000000000 |
| 226-0021 | CS040412000191 | 川井 郁恵 | 1 | 女性 | 1977-01-05 | 42 | 神奈川県横浜市緑区北八朔町********** | S14040 | 20151101 | 1-20091025-4 | 139.5396400000000000 | 35.5276300000000000 |
Commentary :
This code is written in SQL and performs a simpleSELECTstatement on thecustomer_1table.
TheSELECTstatement retrieves all the columns and rows from thecustomer_1table using the*wildcard character.
TheLIMITclause limits the number of rows returned by the query to the first 10 rows.
In summary, this code retrieves the first 10 rows of thecustomer_1table, which contains all the columns from thecustomertable along with two additional columns for the average longitude and latitude of each customer's postal code.
P-086: For the customer data with latitude(latitude) and longitude(longitude) created in 085, combine it with shop data (df_store) using the member application shop code (application_store_cd) as key. Calculate the distance (in km) between the application shop and the customer address using the latitude(latitude) and longitude information (longitude) of the application shop and the latitude and longitude of the customer address (address), and display it together with the customer ID (customer_id), customer address and shop address (address). The following simple formula should be used as the calculation formula, but libraries using other highly accurate methods are also acceptable. Display 10 results.
Latitude(radians):φLongitude(radians):λDistance L=6371∗arccos(sinφ1∗sinφ2+cosφ1∗cosφ2∗cos(λ1−λ2))
%%sql
SELECT
c.customer_id,
c.address AS customer_address,
s.address AS store_address,
6371 * ACOS
(
SIN(RADIANS(c.m_latitude))
* SIN(RADIANS(s.latitude))
+ COS(RADIANS(c.m_latitude))
* COS(RADIANS(s.latitude))
* COS(RADIANS(c.m_longitude) - RADIANS(s.longitude))
) AS distance
FROM
customer_1 c
JOIN
store s
ON
c.application_store_cd = s.store_cd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | customer_address | store_address | distance |
|---|---|---|---|
| CS037613000071 | 東京都江東区南砂********** | 東京都江東区南砂一丁目 | 1.4511822099658445 |
| CS031415000172 | 東京都渋谷区代々木********** | 東京都渋谷区初台二丁目 | 0.4117334789298223 |
| CS028811000001 | 神奈川県横浜市泉区和泉町********** | 神奈川県横浜市瀬谷区二ツ橋町 | 8.065196026704987 |
| CS001215000145 | 東京都大田区仲六郷********** | 東京都大田区仲六郷二丁目 | 1.2684209720729687 |
| CS015414000103 | 東京都江東区北砂********** | 東京都江東区南砂二丁目 | 1.449673414532165 |
| CS015804000004 | 東京都江東区北砂********** | 東京都江東区南砂二丁目 | 1.449673414532165 |
| CS007403000016 | 千葉県八千代市上高野********** | 千葉県佐倉市上志津 | 1.9208032538419055 |
| CS035614000014 | 東京都世田谷区桜新町********** | 東京都世田谷区用賀四丁目 | 1.0214681484997588 |
| CS011215000048 | 神奈川県横浜市港北区日吉本町********** | 神奈川県横浜市港北区日吉本町四丁目 | 0.8182767808775093 |
| CS040412000191 | 神奈川県横浜市緑区北八朔町********** | 神奈川県横浜市緑区長津田みなみ台五丁目 | 3.6641233580823287 |
Commentary :
This SQL code retrieves data from two tables,customer_1andstore, and calculates the distance between each customer's address and the corresponding store address.
TheSELECTstatement retrieves the following columns:customer_idfrom thecustomer_1table.addressfrom thecustomer_1table, renamed tocustomer_address.addressfrom thestoretable, renamed tostore_address.
A calculated value for the distance between the customer and the store, using the Haversine formula. TheACOS,SIN, andCOSfunctions are used to calculate the arc-cosine of the sine and cosine of the latitudes and longitudes of both addresses, respectively. The result is then multiplied by the radius of the Earth (6371 km) to obtain the distance in kilometers.
TheJOINstatement combines the rows from thecustomer_1andstoretables based on the matchingapplication_store_cdandstore_cdcolumns, respectively. TheLIMITclause limits the number of rows returned to 10.
S-087: In the customer data (df_customer), the same customer is registered more than once, e.g. due to applications at different shops. Create the named customer data, in which customers with the same name (customer_name) and postcode (postal_cd) are regarded as the same customer and are collated so that there is one record per customer, and calculate the number of customer data, the number of named customer data and the number of duplicates. However, for the same customer, the one with the highest total sales amount shall be retained, and for customers with the same total sales amount or no sales record, the one with the lowest customer ID (customer_id) number shall be retained.
%%sql
DROP TABLE IF EXISTS customer_u;
CREATE TABLE customer_u AS (
WITH sales_amount AS
(
SELECT
c.customer_id,
c.customer_name,
c.postal_cd,
COALESCE(SUM(r.amount), 0) AS sum_amount
FROM
customer c
LEFT JOIN
receipt r
ON
c.customer_id = r.customer_id
GROUP BY
c.customer_id,
c.customer_name,
c.postal_cd
),
sales_ranking AS
(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY customer_name, postal_cd ORDER BY sum_amount desc, customer_id ) AS ranking
FROM
sales_amount
)
SELECT
c.*
FROM
customer c
JOIN
sales_ranking r
ON
c.customer_id = r.customer_id
AND
r.ranking = 1
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21941 rows affected.
[]
Commentary :
This SQL code creates a new table named "customer_u" by joining data from two tables "customer" and "receipt". It first defines a Common Table Expression (CTE) named "sales_amount" which computes the total sales amount for each customer by joining the "customer" table with the "receipt" table using a left join. The CTE also uses the COALESCE function to return 0 for customers who have no sales.
Then another CTE named "sales_ranking" is defined by computing the sales ranking for each customer based on the total sales amount, within each customer_name and postal_cd group. The ROW_NUMBER() function is used to assign a rank to each customer based on the sum_amount in descending order.
Finally, the SELECT statement joins the "customer" table with the "sales_ranking" CTE, filtering only the rows where the ranking is 1. This effectively selects only the top-ranking customer for each customer_name and postal_cd group. The resulting data is then inserted into a new table named "customer_u".
%%sql
SELECT
customer_cnt,
customer_u_cnt,
customer_cnt - customer_u_cnt AS diff
FROM
(
SELECT
COUNT(1) AS customer_cnt
FROM
customer
) customer
CROSS JOIN
(SELECT
COUNT(1) AS customer_u_cnt
FROM
customer_u
) customer_u
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| customer_cnt | customer_u_cnt | diff |
|---|---|---|
| 21971 | 21941 | 30 |
Commentary :
This SQL code queries the number of customers in two different tables and calculates the difference between them.
The first part of the code is a subquery that selects the total count of customers from the "customer" table and aliases it as "customer_cnt".
The second part of the code is another subquery that selects the total count of customers from the "customer_u" table and aliases it as "customer_u_cnt".
Finally, the outer query selects the "customer_cnt" and "customer_u_cnt" from the two subqueries, and calculates the difference between them as "diff". This is done by subtracting the "customer_u_cnt" from the "customer_cnt".
The result is a single row with three columns - "customer_cnt", "customer_u_cnt", and "diff". The "customer_cnt" column displays the total number of customers in the "customer" table, "customer_u_cnt" column displays the total number of customers in the "customer_u" table, and "diff" column displays the difference between these two counts.
P-088: Based on the data created in 087, create data with integrated name IDs for the customer data. However, the integrated name ID shall be assigned according to the following specifications.
- Non-duplicated customers: set customer ID (customer_id)
- Duplicate customers: set the customer ID of the record extracted in the previous question
The difference between the number of unique customer IDs and the number of unique integration name-drop IDs shall also be checked.
%%sql
DROP TABLE IF EXISTS customer_n;
CREATE TABLE customer_n AS
(
SELECT
c.*,
u.customer_id AS integration_id
FROM
customer c
JOIN
customer_u u
ON
c.customer_name = u.customer_name
AND
c.postal_cd = u.postal_cd
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21971 rows affected.
[]
Commentary :
This SQL code creates a new table calledcustomer_nby joining the existingcustomerandcustomer_utables.
TheJOINcondition matches rows betweencustomerandcustomer_utables with the same values forcustomer_nameandpostal_cdcolumns. For these matching rows, theintegration_idcolumn of the resulting tablecustomer_nis set to thecustomer_idfrom thecustomer_utable.
Therefore, thecustomer_ntable includes all the rows from thecustomertable, along with an additionalintegration_idcolumn that contains thecustomer_idvalues from thecustomer_utable. Rows that do not have matchingcustomer_nameandpostal_cdvalues in thecustomer_utable will have aNULLvalue in theintegration_idcolumn.
%%sql
SELECT
COUNT(1) AS ID difference in numbers
FROM
customer_n
WHERE
customer_id != integration_id;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| Difference in the number of id |
|---|
| 30 |
Commentary :
This SQL code queries the number of rows in the "customer_n" table where the "customer_id" column is not equal to the "integration_id" column.
The "customer_n" table was created in the previous code block by joining the "customer" table with the "customer_u" table using the customer's name and postal code as the join criteria. This effectively integrated the two tables by adding a new "integration_id" column to the "customer" table, which contains the corresponding "customer_id" value from the "customer_u" table.
The current query counts the number of rows in the "customer_n" table where the "customer_id" value is not equal to the "integration_id" value. This means that there are records in the "customer" table that do not have a corresponding record in the "customer_u" table or that have a different "customer_id" value in the "customer_u" table. This can be used to identify data integration issues and investigate the discrepancies further.
S-089: You want to split customers with sales records into training data and test data for building a forecasting model. Split the data randomly in the ratio of 8:2 respectively.
%%sql
SELECT SETSEED(0.1);
CREATE TEMP TABLE IF NOT EXISTS sales_customer AS
(
SELECT
customer_id ,
ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row
FROM
customer
JOIN
receipt
USING(customer_id)
GROUP BY customer_id
HAVING SUM(AMOUNT) > 0
);
DROP TABLE IF EXISTS customer_train;
CREATE TABLE customer_train AS
SELECT
customer.*
FROM
sales_customer
JOIN
customer
USING(customer_id)
WHERE
sales_customer.row <= (SELECT
COUNT(1)
FROM
sales_customer) * 0.8
;
DROP TABLE IF EXISTS customer_test;
CREATE TABLE customer_test AS
SELECT
customer.*
FROM
sales_customer
JOIN
customer
USING(customer_id)
EXCEPT
SELECT
*
FROM
customer_train
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected. 8306 rows affected. Done. 6644 rows affected. Done. 1662 rows affected.
[]
Commentary :
This code is using SQL to create a training and testing dataset for a machine learning model based on customer sales data. Here is a breakdown of what each line does:
%%sql SELECT SETSEED(0.1);
This sets the seed for the random number generator used in the query. This is done to ensure that the same results are obtained each time the query is run.
CREATE TEMP TABLE IF NOT EXISTS sales_customer AS (
SELECT
customer_id,
ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row
FROM
customer
JOIN receipt USING(customer_id)
GROUP BY
customer_id
HAVING
SUM(AMOUNT) > 0
);
This creates a temporary table called sales_customer that contains each customer's customer_id and a randomly assigned row number. It selects data from two tables: customer and receipt using JOIN. It groups data by customer_id and filters the result to only include customers who have made a purchase (SUM(AMOUNT) > 0).
DROP TABLE IF EXISTS customer_train;
CREATE TABLE customer_train AS
SELECT customer.*
FROM sales_customer
JOIN customer USING(customer_id)
WHERE sales_customer.row <= (SELECT COUNT(1) FROM sales_customer) * 0.8 ;
This creates a table called customer_train that contains 80% of the customers in the sales_customer table. It selects all columns from the customer table for customers whose row number in the sales_customer table is less than or equal to 80% of the total number of rows in the sales_customer table.
DROP TABLE IF EXISTS customer_test;
CREATE TABLE customer_test AS
SELECT customer.*
FROM sales_customer
JOIN customer USING(customer_id)
EXCEPT
SELECT * FROM customer_train ;
This creates a table called customer_test that contains the remaining 20% of the customers from the sales_customer table. It selects all columns from the customer table for customers whose customer_id is in the sales_customer table, but not in the customer_train table. This ensures that the test set contains customers that are not in the training set.
%%sql
SELECT
train_cnt * 1.0 / all_cnt as Percentage of training data,
test_cnt * 1.0 / all_cnt as Test Data Percentage
FROM
(SELECT
COUNT(1) AS all_cnt
FROM
sales_customer) all_data
CROSS JOIN
(SELECT
COUNT(1) AS train_cnt
FROM
customer_train) train_data
CROSS JOIN
(SELECT
COUNT(1) AS test_cnt
FROM customer_test) test_data
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| Percentage of training data | Percentage of test data |
|---|---|
| 0.79990368408379484710 | 0.20009631591620515290 |
Commentary :
This code is calculating the percentage of data in the training set and the test set, based on the tables created in the previous SQL code. Here is a breakdown of what each line does:
%%sql SELECT train_cnt * 1.0 / all_cnt as Percentage of training data, test_cnt * 1.0 / all_cnt as Test Data Percentage
This is the main query, which selects the percentage of data in the training set and the test set. It uses train_cnt, test_cnt, and all_cnt variables, which are calculated in the following subqueries:
FROM
(SELECT COUNT(1) AS all_cnt FROM sales_customer) all_data
CROSS JOIN (SELECT COUNT(1) AS train_cnt FROM customer_train) train_data
CROSS JOIN (SELECT COUNT(1) AS test_cnt FROM customer_test) test_data ;
These subqueries calculate the total number of rows in the sales_customer table (all_cnt), the number of rows in the customer_train table (train_cnt), and the number of rows in the customer_test table (test_cnt).
The CROSS JOIN operator is used to combine these subqueries into a single table. Since each subquery returns only one row, the result of the cross join is a single row with three columns: all_cnt, train_cnt, and test_cnt.
The main query then calculates the percentage of data in the training set and the test set by dividing train_cnt and test_cnt by all_cnt, respectively. The * 1.0 expression is used to convert the result to a floating-point number, which ensures that the division is done with floating-point precision.
S-090: The receipt details data (df_receipt) has data from 1 Jan 2017 to 31 Oct 2019. Create three sets of data for building time-series models, 12 months for training and 6 months for testing, by aggregating the sales amount (amount) on a monthly basis.
%%sql
-- Because it is not SQL-oriented, it is described somewhat aggressively (when the number of divisions is large, it is not realistic because SQL becomes long).
-- Also, be careful when the time series is detailed, such as data in seconds, and the data is long term, because the data is huge (in such cases, a language that can train models in a loop process is desirable).
-- (In such cases, it is desirable to use a language that enables model learning in a loop process.
-- As a preliminary step, total the sales amount by year and month, and assign sequential numbers.
CREATE TEMP TABLE IF NOT EXISTS ts_amount AS
(
SELECT
SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6) AS sales_ym,
SUM(amount) AS sum_amount,
ROW_NUMBER() OVER(ORDER BY SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6)) AS rn
FROM
receipt
GROUP BY
sales_ym
);
-- SQL has its limitations, but should be as reusable as possible as the number of datasets created increases.
-- LAG functions in WITH clauses are described in such a way that they can be used by changing the lag period.
DROP TABLE IF EXISTS series_data_1 ;
CREATE TABLE series_data_1 AS (
WITH lag_amount AS
(
SELECT
sales_ym,
sum_amount,
LAG(rn, 0) OVER (ORDER BY rn) AS rn
FROM
ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM
lag_amount
WHERE
rn BETWEEN 1 AND 18);
DROP TABLE IF EXISTS series_data_2 ;
CREATE TABLE series_data_2 AS (
WITH lag_amount AS
(
SELECT
sales_ym,
sum_amount,
LAG(rn, 6) OVER (ORDER BY rn) AS rn
FROM
ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM
lag_amount
WHERE
rn BETWEEN 1 AND 18);
DROP TABLE IF EXISTS series_data_3 ;
CREATE TABLE series_data_3 AS (
WITH lag_amount AS
(
SELECT
sales_ym,
sum_amount,
LAG(rn, 12) OVER (ORDER BY rn) AS rn
FROM
ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM
lag_amount
WHERE
rn BETWEEN 1 AND 18);
* postgresql://padawan:***@db:5432/dsdojo_db 34 rows affected. Done. 18 rows affected. Done. 18 rows affected. Done. 18 rows affected.
[]
Commentary :
This code creates three new tables (series_data_1, series_data_2, and series_data_3) by applying a window function to a summary table (ts_amount) of sales data. Here is a breakdown of what each line does:
%%sql
CREATE TEMP TABLE IF NOT EXISTS ts_amount AS
(
SELECT
SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6) AS sales_ym,
SUM(amount) AS sum_amount,
ROW_NUMBER() OVER(ORDER BY SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6)) AS rn
FROM
receipt
GROUP BY
sales_ym
);
This code creates a temporary table (ts_amount) that summarizes the sales data by month (using the SUBSTR function to extract the year and month from the sales_ymd column), calculates the total amount of sales (SUM(amount)), and assigns a row number to each row (ROW_NUMBER() OVER(...) AS rn). The row numbers are used in the subsequent queries to join the data across multiple time periods.
DROP TABLE IF EXISTS series_data_1 ;
CREATE TABLE series_data_1 AS (
WITH lag_amount AS
(
SELECT
sales_ym,
sum_amount,
LAG(rn, 0) OVER (ORDER BY rn) AS rn
FROM
ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM
lag_amount
WHERE
rn BETWEEN 1 AND 18);
This code creates a new table (series_data_1) by joining the sales data from the current period with the sales data from the previous period (using a window function and the LAG function), and assigning a binary flag (test_flg) to each row based on whether it belongs to the training set (test_flg = 0) or the test set (test_flg = 1). This is done by checking the row number (rn) against a threshold value (12), which splits the data into two parts: the first 12 rows are used for training, and the remaining 6 rows are used for testing. The resulting table has 18 rows in total. The code then repeats this process two more times, with different time periods:
DROP TABLE IF EXISTS series_data_2 ;
CREATE TABLE series_data_2 AS (
WITH lag_amount AS (
SELECT
sales_ym,
sum_amount,
LAG(rn, 6) OVER (ORDER BY rn) AS rn
FROM ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM lag_amount
WHERE rn BETWEEN 1 AND 18);
DROP TABLE IF EXISTS series_data_3 ;
CREATE TABLE series_data_3 AS (
WITH lag_amount AS (
SELECT
sales_ym,
sum_amount,
LAG(rn, 12) OVER (ORDER BY rn) AS rn
FROM ts_amount
)
SELECT
sales_ym,
sum_amount,
CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
FROM lag_amount
WHERE rn BETWEEN 1 AND 18);
series_data_2 uses sales data from six months ago as the reference period, while series_data_3 uses sales data from one year ago.
%%sql
-- series_data_2 and series_data_3 are omitted.
SELECT
*
FROM
series_data_1
;
* postgresql://padawan:***@db:5432/dsdojo_db 18 rows affected.
| sales_ym | sum_amount | test_flg |
|---|---|---|
| 201701 | 902056 | 0 |
| 201702 | 764413 | 0 |
| 201703 | 962945 | 0 |
| 201704 | 847566 | 0 |
| 201705 | 884010 | 0 |
| 201706 | 894242 | 0 |
| 201707 | 959205 | 0 |
| 201708 | 954836 | 0 |
| 201709 | 902037 | 0 |
| 201710 | 905739 | 0 |
| 201711 | 932157 | 0 |
| 201712 | 939654 | 0 |
| 201801 | 944509 | 1 |
| 201802 | 864128 | 1 |
| 201803 | 946588 | 1 |
| 201804 | 937099 | 1 |
| 201805 | 1004438 | 1 |
| 201806 | 1012329 | 1 |
Commentary :
This code is a SQL query that selects all columns and rows from the tableseries_data_1. The tableseries_data_1is created in the previous SQL code block using theCREATE TABLEstatement.
In the creation ofseries_data_1, the query creates a temporary tablets_amountwhich calculates the sum of amounts for each month and also assigns a row number based on the month. Using the row number and theLAG()window function, the query creates a new tableseries_data_1that contains the sales amount for each month, and a flag indicating whether the row is part of the training data or test data.
Finally, in this code block, the query selects all columns and rows fromseries_data_1, which allows you to view the results of the previous SQL statements.
S-091: For each customer in the customer data (df_customer), under-sample the number of customers with sales performance and the number of customers without sales performance so that the ratio is 1:1.
%%sql
SELECT SETSEED(0.1);
CREATE TEMP TABLE IF NOT EXISTS down_sampling AS (
WITH pre_table_1 AS
(
SELECT
c.*,
COALESCE(r.sum_amount,0) AS sum_amount
FROM
customer c
LEFT JOIN
(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
GROUP BY
customer_id
) r
ON
c.customer_id=r.customer_id
)
,pre_table_2 AS(
SELECT
*
,CASE WHEN sum_amount > 0 THEN 1 ELSE 0 END AS is_buy_flag
,CASE WHEN sum_amount = 0 THEN 1 ELSE 0 END AS is_not_buy_flag
FROM
pre_table_1
)
,pre_table_3 AS
(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY is_buy_flag ORDER BY RANDOM())
FROM
pre_table_2
CROSS JOIN
(SELECT SUM(is_buy_flag) AS buying FROM pre_table_2) AS t1
CROSS JOIN
(SELECT SUM(is_not_buy_flag) AS not_buying FROM pre_table_2) AS t2
)
SELECT
*
FROM
pre_table_3
WHERE
row_number <= buying
AND row_number <= not_buying
);
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected. 16612 rows affected.
[]
Commentary :
This code is a SQL query that performs downsampling of customer data. Downsampling is a process of randomly removing some data points from a dataset to balance the classes or reduce the dataset size.
In the query, the first statement sets the random seed value. The next statement creates a temporary table calleddown_samplingthat includes the customer data and the sum of amounts for each customer. TheLEFT JOINclause in the first CTE combines the customer table with the receipt table, calculating the sum of amounts for each customer.
In the second CTE, two additional columns are created using theCASEstatement. These columns are flags that indicate whether a customer has made a purchase or not. Then, in the third CTE, theROW_NUMBER()window function is used to assign a row number to each customer record based on theis_buy_flagcolumn.
Finally, in the last statement of the query, the temporary tabledown_samplingis selected, but only the rows where the row number is less than or equal to the number of buying customers and the number of non-buying customers. This ensures that the number of rows for each class is balanced in the resulting table.
%%sql
SELECT
is_buy_flag,
COUNT(1)
FROM
down_sampling
GROUP BY
is_buy_flag
;
* postgresql://padawan:***@db:5432/dsdojo_db 2 rows affected.
| is_buy_flag | count |
|---|---|
| 0 | 8306 |
| 1 | 8306 |
Commentary :
This code runs a SQL query that groups the records in thedown_samplingtemporary table by theiris_buy_flagcolumn and counts the number of records in each group.
Thedown_samplingtable was created in the previous code cell by performing a left join between thecustomertable and a subquery that aggregates theamountcolumn from thereceipttable bycustomer_id. The resulting table has columns for customer information as well as the total sum of amounts purchased by each customer.
Thepre_table_2subquery creates two new columns based on thesum_amountcolumn. Theis_buy_flagcolumn is set to 1 for customers who have made at least one purchase and 0 for customers who have not made any purchases. Theis_not_buy_flagcolumn is set to 1 for customers who have not made any purchases and 0 for customers who have made at least one purchase.
Thepre_table_3subquery generates a row number for each row in thepre_table_2table based on theis_buy_flagcolumn and a random order. It also calculates the total number of rows in thepre_table_2table whereis_buy_flagis 1 and where it is 0 using cross joins with two separate subqueries.
Finally, the mainSELECTstatement selects all rows frompre_table_3where the row number is less than or equal to the total number of rows whereis_buy_flagis 1 and where it is 0. This ensures that there are an equal number of records for each value ofis_buy_flag.
Therefore, the output of theSELECTstatement shows the count of records in thedown_samplingtable whereis_buy_flagis 0 and where it is 1.
S-092: Normalise to third normal form for gender in customer data (df_customer).
%%sql
DROP TABLE IF EXISTS customer_std;
CREATE TABLE customer_std AS
(
SELECT
customer_id,
customer_name,
gender_cd,
birth_day,
age,
postal_cd,
application_store_cd,
application_date,
status_cd
FROM
customer
);
DROP TABLE IF EXISTS gender_std;
CREATE TABLE gender_std AS
(
SELECT DISTINCT
gender_cd,
gender
FROM
customer
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 21971 rows affected. Done. 3 rows affected.
[]
Commentary :
This SQL code creates two new tables from an existing table namedcustomer.
The first part of the code drops a table namedcustomer_stdif it exists and then creates a new table namedcustomer_std. The new table has the same columns as thecustomertable, which arecustomer_id,customer_name,gender_cd,birth_day,age,postal_cd,application_store_cd,application_date, andstatus_cd. The purpose of creating this new table is not clear from the code provided. It might be created to standardize or preprocess the data.
The second part of the code drops a table namedgender_stdif it exists and then creates a new table namedgender_std. The new table has two columns namedgender_cdandgender. Thegender_cdcolumn is a distinct list of all gender codes found in thecustomertable, while thegendercolumn is the corresponding gender description. This table might be used for mapping the gender codes to their respective gender descriptions.
%%sql
-- Data content checks
SELECT
*
FROM
customer_std
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
| customer_id | customer_name | gender_cd | birth_day | age | postal_cd | application_store_cd | application_date | status_cd |
|---|---|---|---|---|---|---|---|---|
| CS021313000114 | 大野 あや子 | 1 | 1981-04-29 | 37 | 259-1113 | S14021 | 20150905 | 0-00000000-0 |
| CS037613000071 | 六角 雅彦 | 9 | 1952-04-01 | 66 | 136-0076 | S13037 | 20150414 | 0-00000000-0 |
| CS031415000172 | 宇多田 貴美子 | 1 | 1976-10-04 | 42 | 151-0053 | S13031 | 20150529 | D-20100325-C |
Commentary :
This code selects the first 3 rows from thecustomer_stdtable, which was created earlier in the code. Thecustomer_stdtable is a copy of thecustomertable with a subset of columns, so this code is simply displaying the values of those columns for the first three customers in the table. This is a basic check to confirm that the data has been correctly imported into the new table.
%%sql
-- Data content checks
SELECT
*
FROM
gender_std
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
| gender_cd | gender |
|---|---|
| 0 | 男性 |
| 9 | 不明 |
| 1 | 女性 |
Commentary :
This code executes an SQL query to select the first 3 rows from thegender_stdtable that was created in the previous code block.
TheLIMITclause is used to restrict the number of rows returned by the query to 3. Since thegender_stdtable was created using aSELECT DISTINCTstatement, it will only contain unique values ofgender_cdand their correspondinggendervalues from thecustomertable. Therefore, this query will return the first 3 unique rows from thegender_stdtable.
The output of this query will show thegender_cdandgendervalues for the first 3 rows of thegender_stdtable.
S-093: Product data (df_product) only holds the code values for each category, but not the category names. Combine with category data (df_category) and denormalise to create new product data that holds category names.
%%sql
DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full AS
(
SELECT
p.product_cd,
p.category_major_cd,
c.category_major_name,
p.category_medium_cd,
c.category_medium_name,
p.category_small_cd,
c.category_small_name,
p.unit_price,
p.unit_cost
FROM
product p
JOIN
category c
USING(category_small_cd)
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10030 rows affected.
[]
Commentary :
This code creates a new table called "product_full" by joining two tables "product" and "category".
The "product" table contains information about each product, such as its product code, category codes, unit price, and unit cost.
The "category" table contains category codes and category names for each category level.
The "JOIN" clause connects the two tables based on the category_small_cd column. By using the "USING" keyword, the query matches rows where the values in category_small_cd are equal in both tables.
The new table "product_full" includes all the columns from the "product" table, plus additional columns for category names, which are obtained by looking up the category codes in the "category" table.
%%sql
-- Data content checks
SELECT
*
FROM
product_full
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
| product_cd | category_major_cd | category_major_name | category_medium_cd | category_medium_name | category_small_cd | category_small_name | unit_price | unit_cost |
|---|---|---|---|---|---|---|---|---|
| P040101001 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 198 | 149 |
| P040101002 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 218 | 164 |
| P040101003 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 230 | 173 |
Commentary :
This code selects data from two tables,productandcategory, and creates a new table calledproduct_full.
TheJOINclause is used to join the two tables based on a common column,category_small_cd. This means that the resulting table will have only those rows where the value incategory_small_cdis the same in both tables.
The columns selected from the two tables are:product_cd,category_major_cd,category_major_name,category_medium_cd,category_medium_name,category_small_cd,category_small_name,unit_price, andunit_cost.
The resulting tableproduct_fullwill contain these columns for every row that meets the join criteria.
TheLIMITclause is used to restrict the number of rows displayed in the output to just 3.
S-094: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|UTF-8|
The path to the file output destination should be as follows (COPY command authorisation granted).
|Output destination| |:–:| |/tmp/data|
- By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.
%%sql
COPY product_full TO '/tmp/data/S_product_full_UTF-8_header.csv'
WITH CSV HEADER ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code exports the contents of theproduct_fulltable to a CSV file named "S_product_full_UTF-8_header.csv" located in the "/tmp/data" directory. TheCOPYcommand is a PostgreSQL command that is used to copy data between files and tables. In this case, theTOkeyword is used to specify the file to which the data is copied. TheCSVkeyword specifies the file format as comma-separated values, and theHEADERkeyword specifies that the first row of the file should contain the column headers. Finally, theENCODINGkeyword is used to specify the character encoding of the output file as UTF-8.
S-095: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|CP932|
In PostgreSQL, specifying Shift_JIS is equivalent to CP932. The path to the file output destination should be as follows (COPY command permissions have been granted).
|Output destination| |:–:| |/tmp/data|
- By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.
%%sql
COPY product_full TO '/tmp/data/S_product_full_SJIS_header.csv'
WITH CSV HEADER ENCODING 'SJIS';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code is using theCOPYcommand in PostgreSQL to export the contents of theproduct_fulltable to a CSV file namedS_product_full_SJIS_header.csv. TheWITH CSV HEADERoption specifies that the output should include a header row with column names, and theENCODING 'SJIS'option specifies that the output file should be encoded using Shift-JIS character encoding.
Shift-JIS is a character encoding commonly used in Japan, and is one of several possible character encodings that can be used for CSV files. By exporting the data to a CSV file with Shift-JIS encoding, it allows the file to be easily imported into other applications or databases that require this specific encoding.
S-096: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|Without|UTF-8|
The path to the file output destination should be as follows (COPY command authorisation granted).
|Output destination| |:–:| |/tmp/data|
- By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.
%%sql
COPY product_full TO '/tmp/data/S_product_full_UTF-8_noh.csv'
WITH CSV ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code exports the contents of theproduct_fulltable to a CSV file namedS_product_full_UTF-8_noh.csvin the/tmp/data/directory without including the header row in the output file.
TheCOPYcommand is used in Postgres to copy data from/to a file. In this case, theproduct_fulltable is being copied to a file. TheWITH CSVoption specifies that the file is in comma-separated value (CSV) format. TheENCODINGoption specifies the character encoding of the file. TheUTF-8value indicates that the file uses UTF-8 encoding, which supports a wide range of characters from different languages.
Since theHEADERoption is not specified, the output file does not include the header row. Therefore, the output file will only contain the rows of data from theproduct_fulltable in CSV format.
S-097: Load the file in the following format created in 094, display three data items and check that they have been imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|
%%sql
DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full
(
product_cd VARCHAR(10),
category_major_cd VARCHAR(2),
category_major_name VARCHAR(20),
category_medium_cd VARCHAR(4),
category_medium_name VARCHAR(20),
category_small_cd VARCHAR(6),
category_small_name VARCHAR(20),
unit_price INTEGER,
unit_cost INTEGER
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. Done.
[]
Commentary :
This SQL code drops the tableproduct_fullif it exists and creates a new table with the same name. The new table has eight columns:product_cd,category_major_cd,category_major_name,category_medium_cd,category_medium_name,category_small_cd,category_small_name,unit_price, andunit_cost. The first seven columns are of typeVARCHARwith varying lengths, while the last two columns are of typeINTEGER.
By specifying the data types and lengths of the columns, the code is defining the schema of theproduct_fulltable. This ensures that each column has a specific data type and length, which can improve the performance of queries on this table, as well as prevent errors that may occur if data of the wrong type or length is inserted.
%%sql
COPY product_full FROM '/tmp/data/S_product_full_UTF-8_header.csv'
WITH CSV HEADER ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code imports data from a CSV file located at '/tmp/data/S_product_full_UTF-8_header.csv' into a new table called 'product_full' that has already been created in the database.
TheCOPYcommand in PostgreSQL allows you to copy data from a file to a table or vice versa. TheFROMkeyword is used to specify the path to the CSV file, and theWITHkeyword is used to provide options for theCOPYcommand. In this case,CSVspecifies that the file is in Comma-Separated Value format,HEADERspecifies that the first row of the file contains column names, andENCODING 'UTF-8'specifies the character encoding used in the file.
Since the table 'product_full' has already been created in the database, theCOPYcommand inserts the data from the CSV file into the existing table.
%%sql
SELECT
*
FROM
product_full
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
| product_cd | category_major_cd | category_major_name | category_medium_cd | category_medium_name | category_small_cd | category_small_name | unit_price | unit_cost |
|---|---|---|---|---|---|---|---|---|
| P040101001 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 198 | 149 |
| P040101002 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 218 | 164 |
| P040101003 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 230 | 173 |
Commentary :
This SQL code is a simple SELECT statement that retrieves the first three rows of data from the product_full table. The data was previously loaded into the table using the COPY command from a CSV file. The query is using the LIMIT clause to limit the number of rows returned to 3.
The SELECT statement retrieves all columns from the product_full table. The table contains information about products, including their codes, categories, and prices. The data is organized into several columns, including product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, and unit_cost.
S-098: Load the file in the following format created in 096, display three data items and check that they are imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|Without|UTF-8|
%%sql
DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full
(
product_cd VARCHAR(10),
category_major_cd VARCHAR(2),
category_major_name VARCHAR(20),
category_medium_cd VARCHAR(4),
category_medium_name VARCHAR(20),
category_small_cd VARCHAR(6),
category_small_name VARCHAR(20),
unit_price INTEGER,
unit_cost INTEGER
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. Done.
[]
Commentary :
This code drops the table namedproduct_fullif it exists, and creates a new table with the same name and schema. The new table has 9 columns:product_cdcolumn is of type VARCHAR(10)category_major_cdcolumn is of type VARCHAR(2)category_major_namecolumn is of type VARCHAR(20)category_medium_cdcolumn is of type VARCHAR(4)category_medium_namecolumn is of type VARCHAR(20)category_small_cdcolumn is of type VARCHAR(6)category_small_namecolumn is of type VARCHAR(20)unit_pricecolumn is of type INTEGERunit_costcolumn is of type INTEGER
TheVARCHARdata type is used to store variable-length character strings with the specified maximum length. TheINTEGERdata type is used to store whole numbers.
This code creates an empty table with the specified schema, and no data is added to the table.
%%sql
COPY product_full FROM '/tmp/data/S_product_full_UTF-8_noh.csv'
WITH CSV ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code is importing data from a CSV file named 'S_product_full_UTF-8_noh.csv' into the table 'product_full' using the COPY command in PostgreSQL.
The WITH CSV option specifies that the data is in CSV format. The ENCODING option specifies the character encoding of the file, which is UTF-8 in this case.
Since the file doesn't have a header row, the column names and types are specified explicitly in a preceding SQL statement which creates the 'product_full' table.
Once the data is imported into the table, the SELECT statement is used to retrieve the first 3 rows of the 'product_full' table to confirm that the data has been imported correctly.
%%sql
SELECT
*
FROM
product_full
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
| product_cd | category_major_cd | category_major_name | category_medium_cd | category_medium_name | category_small_cd | category_small_name | unit_price | unit_cost |
|---|---|---|---|---|---|---|---|---|
| P040101001 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 198 | 149 |
| P040101002 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 218 | 164 |
| P040101003 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 230 | 173 |
Commentary :
This code selects the first 3 rows from the "product_full" table using SQL SELECT statement. The SELECT statement retrieves all columns (product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, unit_cost) from the "product_full" table and limits the output to the first 3 rows using the LIMIT keyword. The result shows the first 3 rows of the "product_full" table.
S-099: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |TSV (Tab Separated Values)|With|UTF-8|
The path to the file output destination should be as follows (COPY command authorisation has been granted).
|Output destination| |:–:| |/tmp/data|
- By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.
%%sql
COPY product_full TO '/tmp/data/S_product_full_UTF-8_header.tsv'
WITH CSV HEADER DELIMITER E'\t' ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code exports the contents of theproduct_fulltable in CSV format, with a tab (\t) used as a delimiter instead of a comma. The exported file is saved in the/tmp/datadirectory and namedS_product_full_UTF-8_header.tsv. TheHEADERoption is included, so the first row of the file will contain the column headers.
Here's a breakdown of the individual components of the code:COPY product_full: specifies that the data from theproduct_fulltable will be copied to a file.TO '/tmp/data/S_product_full_UTF-8_header.tsv': specifies the file path and name of the exported file.WITH CSV: specifies that the exported file will be in CSV format.HEADER: specifies that the first row of the file will contain the column headers.DELIMITER E'\t': specifies that the tab character will be used as a delimiter.ENCODING 'UTF-8': specifies the character encoding to use when exporting the file.
S-100: Load the file in the following format created in 099, display three items of data and check that they have been imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|
%%sql
DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full
(
product_cd VARCHAR(10),
category_major_cd VARCHAR(2),
category_major_name VARCHAR(20),
category_medium_cd VARCHAR(4),
category_medium_name VARCHAR(20),
category_small_cd VARCHAR(6),
category_small_name VARCHAR(20),
unit_price INTEGER,
unit_cost INTEGER
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. Done.
[]
Commentary :
This SQL code block first drops the table namedproduct_fullif it exists, and then creates a new table with the same name and schema.
The schema of theproduct_fulltable has 9 columns with the following data types:product_cd:VARCHAR(10)(i.e., a variable-length character string with a maximum length of 10)category_major_cd:VARCHAR(2)category_major_name:VARCHAR(20)category_medium_cd:VARCHAR(4)category_medium_name:VARCHAR(20)category_small_cd:VARCHAR(6)category_small_name:VARCHAR(20)unit_price:INTEGERunit_cost:INTEGERThe purpose of this code block is to define the structure of theproduct_fulltable before data is loaded into it.
%%sql
COPY product_full FROM '/tmp/data/S_product_full_UTF-8_header.tsv'
WITH CSV HEADER DELIMITER E'\t' ENCODING 'UTF-8';
* postgresql://padawan:***@db:5432/dsdojo_db 10030 rows affected.
[]
Commentary :
This code copies data from a TSV (tab-separated values) file located at '/tmp/data/S_product_full_UTF-8_header.tsv' and inserts it into the 'product_full' table in the current database.
The COPY command specifies that the input file is a CSV file, with headers included in the first row of the file, and that the delimiter used is a tab character ('\t'). The ENCODING parameter specifies that the file is encoded in UTF-8.
This command assumes that the 'product_full' table already exists and has the same structure as the data in the input file (product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, unit_cost). The command will insert the data from the file into the table, creating new rows in the table for each row of data in the file.
%%sql
SELECT
*
FROM
product_full
LIMIT
3
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
| product_cd | category_major_cd | category_major_name | category_medium_cd | category_medium_name | category_small_cd | category_small_name | unit_price | unit_cost |
|---|---|---|---|---|---|---|---|---|
| P040101001 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 198 | 149 |
| P040101002 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 218 | 164 |
| P040101003 | 04 | 惣菜 | 0401 | 御飯類 | 040101 | 弁当類 | 230 | 173 |
Commentary :
This code executes a SQL query to select the first three rows from theproduct_fulltable. TheSELECTstatement is followed by*, which means to select all columns from the table.LIMIT 3limits the result set to the first three rows of the table.
The result will display the data in theproduct_fulltable according to the columns specified in the table. The exact output will depend on the data stored in the table.
This is the end of the 100. Well done!









Comment