参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_SQL.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction
How to use
100 data processing knocks

Introduction

The database is PostgreSQL13
First, execute the following cell
You can issue SQL by writing %%sql in the cell
You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
Names, addresses etc. are dummy data and do not exist.

In [ ]:

%load_ext sql
import os

pgconfig = {
    'host': 'db',
    'port': os.environ['PG_PORT'],
    'database': os.environ['PG_DATABASE'],
    'user': os.environ['PG_USER'],
    'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)

# Set up Magic commands to write SQL
%sql $dsl

Out[ ]:

'Connected: padawan@dsdojo_db'

How to use

You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.

In [ ]:

%%sql
SELECT 'You can run it like this' AS sample;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

sample
You can run it like this

100 data processing knocks

S-081: Create new product data for the missing values of unit_price（unit_price） and unit_cost（unit_cost）, supplemented by the average value of each. Note that for the average values, round off values less than one yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.

In [ ]:

%%sql

DROP TABLE IF EXISTS product_2;
CREATE TABLE product_2 AS (
  SELECT
    product_cd, 
    category_major_cd, 
    category_medium_cd, 
    category_small_cd, 
    COALESCE(unit_price, unit_avg) AS unit_price,
    COALESCE(unit_cost, cost_avg) AS unit_cost
  FROM
    product
  CROSS JOIN 
  (
    SELECT
      ROUND(AVG(unit_price)) AS unit_avg,
      ROUND(AVG(unit_cost)) AS cost_avg
    FROM
      product
 ) stats_product
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
10030 rows affected.

Out[ ]:

[]

Commentary :

This code is written in SQL language and it performs the following steps:

If a table named product_2 already exists in the database, it will be dropped. This is done to ensure that we start with a clean slate.

A new table named product_2 is created using the CREATE TABLE statement.

The data for the new table is derived from the product table using a SELECT statement.

In the SELECT statement, the columns product_cd, category_major_cd, category_medium_cd, and category_small_cd are selected from the product table.

Two new columns are created in the SELECT statement using the COALESCE function. COALESCE returns the first non-null value in a list of expressions. In this case, if unit_price is null, the average unit price of all products (unit_avg) is used instead. Similarly, if unit_cost is null, the average unit cost of all products (cost_avg) is used instead.

The FROM clause in the SELECT statement performs a CROSS JOIN with a subquery that calculates the average unit price and unit cost of all products in the product table. The ROUND function is used to round the average values to the nearest integer.

The result of the SELECT statement is used to populate the product_2 table using the INSERT INTO statement that is implicitly executed by the CREATE TABLE statement.

In summary, the code creates a new table named product_2 that contains the same columns as the product table, but with null values in the unit_price and unit_cost columns replaced by the average unit price and unit cost values, respectively, calculated across all products.

In [ ]:

%%sql

SELECT 
  SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
  SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM 
  product_2
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

unit_price	unit_cost
0	0

Commentary :

This code is written in SQL language and it performs the following steps:

A SELECT statement is executed to query the product_2 table.

The SUM function is used in combination with a CASE expression to count the number of null values in the unit_price and unit_cost columns separately.

In each CASE expression, WHEN unit_price IS NULL or WHEN unit_cost IS NULL checks if the value of the respective column is null.

If the column has a null value, the CASE expression evaluates to 1, otherwise it evaluates to 0.

The SUM function then sums up the 1's and 0's from each row to calculate the total number of null values in the unit_price and unit_cost columns, respectively.

The AS keyword is used to give the resulting columns in the result set meaningful names, i.e., unit_price and unit_cost.

Finally, the LIMIT clause is used to restrict the output to the first 10 rows.

In summary, this code returns a result set that displays the number of null values in the unit_price and unit_cost columns of the product_2 table. The output indicates how many rows have missing data in these columns, which can be useful in identifying data quality issues that need to be addressed.

S-082: Create new product data for missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the median value of each. Note that for the median value, round off the figures to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.

In [ ]:

%%sql

DROP TABLE IF EXISTS product_3;
CREATE TABLE product_3 AS 
  (
    SELECT
      product_cd, 
      category_major_cd, 
      category_medium_cd, 
      category_small_cd, 
      COALESCE(unit_price, unit_med) AS unit_price,
      COALESCE(unit_cost, cost_med) AS unit_cost
    FROM
      product
    CROSS JOIN (
        SELECT
          ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_price)) AS unit_med,
          ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_cost)) AS cost_med
        FROM
          product
    ) stats_product
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
10030 rows affected.

Out[ ]:

[]

Commentary :

This code is written in SQL language and it performs the following steps:

If a table named product_3 already exists in the database, it will be dropped. This is done to ensure that we start with a clean slate.

A new table named product_3 is created using the CREATE TABLE statement.

The data for the new table is derived from the product table using a SELECT statement.

In the SELECT statement, the columns product_cd, category_major_cd, category_medium_cd, and category_small_cd are selected from the product table.

Two new columns are created in the SELECT statement using the COALESCE function. COALESCE returns the first non-null value in a list of expressions. In this case, if unit_price is null, the median unit price of all products (unit_med) is used instead. Similarly, if unit_cost is null, the median unit cost of all products (cost_med) is used instead.

The FROM clause in the SELECT statement performs a CROSS JOIN with a subquery that calculates the median unit price and unit cost of all products in the product table. The PERCENTILE_CONT function is used to calculate the median values. The WITHIN GROUP clause specifies the ordering of the rows, so that the function knows which values to use to calculate the median.

The ROUND function is used to round the median values to the nearest integer.

The result of the SELECT statement is used to populate the product_3 table using the INSERT INTO statement that is implicitly executed by the CREATE TABLE statement.

In summary, the code creates a new table named product_3 that contains the same columns as the product table, but with null values in the unit_price and unit_cost columns replaced by the median unit price and unit cost values, respectively, calculated across all products. The median values are calculated using the PERCENTILE_CONT function with a WITHIN GROUP clause that specifies the ordering of the rows.

In [ ]:

%%sql

SELECT 
  SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
  SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM 
  product_3
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

unit_price	unit_cost
0	0

Commentary :

This code is written in SQL language and it performs the following steps:

A SELECT statement is executed to query the product_3 table.

The SUM function is used in combination with a CASE expression to count the number of null values in the unit_price and unit_cost columns separately.

In each CASE expression, WHEN unit_price IS NULL or WHEN unit_cost IS NULL checks if the value of the respective column is null.

If the column has a null value, the CASE expression evaluates to 1, otherwise it evaluates to 0.

The SUM function then sums up the 1's and 0's from each row to calculate the total number of null values in the unit_price and unit_cost columns, respectively.

The AS keyword is used to give the resulting columns in the result set meaningful names, i.e., unit_price and unit_cost.

Finally, the LIMIT clause is used to restrict the output to the first 10 rows.

In summary, this code returns a result set that displays the number of null values in the unit_price and unit_cost columns of the product_3 table. The output indicates how many rows have missing data in these columns, which can be useful in identifying data quality issues that need to be addressed. The product_3 table was created in a previous step, with null values in the unit_price and unit_cost columns replaced by the median unit price and unit cost values, respectively, calculated across all products.

S-083: Create new product data for the missing values of unit price(unit_price) and unit cost(unit cost), supplemented by the median value calculated for each product category sub-category code (category_small_cd). Note that the median value should be rounded to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no deficiencies for each item.

In [ ]:

%%sql

DROP TABLE IF EXISTS product_4;
CREATE TABLE product_4 AS (
    WITH category_median AS
    (
      SELECT
        category_small_cd, 
        ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_price)) AS unit_med,
        ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY unit_cost)) AS cost_med
      FROM 
        product
      GROUP BY 
        category_small_cd
    )
    SELECT
      product_cd, 
      category_major_cd, 
      category_medium_cd, 
      category_small_cd, 
      COALESCE(unit_price, unit_med) AS unit_price,
      COALESCE(unit_cost, cost_med) AS unit_cost
    FROM
      product
    JOIN
      category_median
    USING(category_small_cd)
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
10030 rows affected.

Out[ ]:

[]

Commentary :

This code is written in SQL language and it performs the following steps:

A DROP TABLE statement is executed to drop the product_4 table if it already exists.

A CREATE TABLE statement is executed to create the product_4 table.

A common table expression (CTE) named category_median is defined. This CTE uses a SELECT statement with an ORDER BY clause to sort the rows in the product table by the unit_price and unit_cost columns within each category, and then calculates the median values of unit_price and unit_cost for each category using the PERCENTILE_CONT function. The GROUP BY clause groups the data by category_small_cd column.

The main SELECT statement retrieves data from the product table and joins it with the category_median CTE on the category_small_cd column. The COALESCE function is used to replace null values in the unit_price and unit_cost columns with the median values for their corresponding categories.

Finally, the resulting data is inserted into the product_4 table using the CREATE TABLE AS syntax.

In summary, this code creates a new table product_4 by joining the product table with a CTE category_median. The median values of unit_price and unit_cost are calculated for each category in the product table using the PERCENTILE_CONT function. The resulting table product_4 contains data from the product table with null values in the unit_price and unit_cost columns replaced by the median values for their corresponding categories.

In [ ]:

%%sql

SELECT 
  SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
  SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM 
  product_4
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

unit_price	unit_cost
0	0

Commentary :

This code is also written in SQL and it performs the following steps:

A SELECT statement is executed to retrieve the sum of null values in the unit_price and unit_cost columns of the product_4 table.

The CASE statement in the SUM function checks whether each value in the unit_price and unit_cost columns is null or not. If it is null, then it returns 1, otherwise, it returns 0. This way, the SUM function calculates the total number of null values in each column.

The AS keyword is used to rename the columns as unit_price and unit_cost respectively.

The LIMIT clause is used to limit the number of rows returned by the query to 10.

In summary, this code calculates the total number of null values in the unit_price and unit_cost columns of the product_4 table. This can help to identify missing data in the table and determine whether further data cleaning or imputation is necessary.

S-084: Prepare new data for all customers in the customer data (df_customer) by calculating the percentage of the 2019 sales value to the sales value for the whole period. However, if there are no sales results, treat them as 0. Then, extract the calculated percentages that exceed 0 and display the results for 10 cases. Also check that there are no missing data in the data created.

In [ ]:

%%sql

DROP TABLE IF EXISTS sales_rate;

CREATE TABLE sales_rate AS(
    WITH sales_amount_2019 AS 
    (
      SELECT
        customer_id,
        SUM(amount) AS sum_amount_2019
      FROM
        receipt
      WHERE
        sales_ymd BETWEEN 20190101 AND 20191231
      GROUP BY
        customer_id
    ),
    sales_amount_all AS 
    (
      SELECT
        customer_id,
        SUM(amount) AS sum_amount_all
      FROM
        receipt
      GROUP BY
        customer_id
    )
SELECT
  a.customer_id,
  COALESCE(b.sum_amount_2019, 0) AS sales_amount_2019,
  COALESCE(c.sum_amount_all, 0)  AS sales_amount_all,
  CASE COALESCE(c.sum_amount_all, 0) WHEN 0 THEN 0 ELSE COALESCE(b.sum_amount_2019, 0) * 1.0 / c.sum_amount_all END AS sales_rate
FROM
  customer a
LEFT JOIN
  sales_amount_2019 b
ON 
  a.customer_id = b.customer_id
LEFT JOIN
  sales_amount_all c
ON 
  a.customer_id = c.customer_id);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
21971 rows affected.

Out[ ]:

[]

Commentary :

This code is also written in SQL and it performs the following steps:

A SELECT statement is executed to retrieve data from the receipt and customer tables.

Two subqueries are created to calculate the total sales amount for each customer in the year 2019 and for all years.

The SELECT statement joins the customer table with the two subqueries using the LEFT JOIN operator. This ensures that all customers in the customer table are included in the output, even if they have no sales data in the receipt table.

The COALESCE function is used to handle cases where a customer has no sales data in the receipt table. It replaces NULL values with 0.

The CASE statement calculates the sales rate for each customer. It checks if the total sales amount for all years is 0, in which case it returns 0. Otherwise, it divides the sales amount for the year 2019 by the total sales amount for all years.

The results are inserted into a new table called sales_rate.

In summary, this code calculates the sales rate for each customer based on their sales data from the receipt table. The sales rate is the ratio of the sales amount for the year 2019 to the total sales amount for all years. The results are saved in a new table called sales_rate.

In [ ]:

%%sql

SELECT 
  * 
FROM 
  sales_rate
WHERE 
  sales_rate > 0
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sales_amount_2019	sales_amount_all	sales_rate
CS031415000172	2971	5088	0.58392295597484276730
CS015414000103	874	3122	0.27994875080076873799
CS011215000048	248	3444	0.07200929152148664344
CS029415000023	3767	5167	0.72904973872653377201
CS035415000029	5823	7504	0.77598614072494669510
CS023513000066	208	771	0.26977950713359273671
CS035513000134	463	1565	0.29584664536741214058
CS001515000263	216	216	1.00000000000000000000
CS006415000279	229	229	1.00000000000000000000
CS031415000106	215	7741	0.02777418938121689704

In [ ]:

%%sql

SELECT
  SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS unit_price,
  SUM(CASE WHEN sales_amount_2019 IS NULL THEN 1 ELSE 0 END) AS unit_price,
  SUM(CASE WHEN sales_amount_all IS NULL THEN 1 ELSE 0 END) AS unit_cost,
  SUM(CASE WHEN sales_rate IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM 
  sales_rate 
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

unit_price	unit_price_1	unit_cost	unit_cost_1
0	0	0	0

Commentary :

This code is also written in SQL and it performs the following steps:

A SELECT statement is executed to retrieve all rows from the sales_rate table where the sales_rate value is greater than 0.

The LIMIT clause limits the output to the first 10 rows.

The results are returned as output.

In summary, this code selects all rows from the sales_rate table where the sales rate is greater than 0, which means that the customer has made at least one sale in the year 2019. The LIMIT clause limits the output to the first 10 rows.

S-085: Create new customer data by linking geocode data (df_geocode) to all customers in the customer data (df_customer) using postcode (postal_cd). However, if multiple longitude (longitude) and latitude (latitude) information is associated with a single postcode (postal_cd), calculate and use the average values of longitude (longitude) and latitude (latitude). Also, display 10 results to check the creation results.

In [ ]:

%%sql

DROP TABLE IF EXISTS customer_1;

CREATE TABLE customer_1 AS (
    WITH geocode_avg AS
    (
       SELECT
         postal_cd,
         AVG(longitude) AS m_longitude,
         AVG(latitude) AS m_latitude
       FROM
         geocode
       GROUP BY 
         postal_cd
    )
SELECT 
  *
FROM
  customer c
JOIN
  geocode_avg g
USING(postal_cd)
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
21971 rows affected.

Out[ ]:

[]

Commentary :

This code is written in SQL and it performs the following steps:

A WITH clause is used to define a subquery that calculates the average longitude and latitude for each postal code using the geocode table.

The SELECT statement joins the customer table with the subquery defined in the WITH clause using the postal_cd column.

The result of the join operation is then inserted into a new table called customer_1 using the CREATE TABLE statement. If a table with the same name already exists, it is first dropped using the DROP TABLE IF EXISTS statement.

In summary, this code creates a new table called customer_1 by joining the customer table with the geocode table using the postal_cd column and appending the average longitude and latitude for each postal code to the customer table. The resulting table has all the columns from the customer table along with two additional columns for the average longitude and latitude.

In [ ]:

%%sql

SELECT 
  * 
FROM 
  customer_1 
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

postal_cd	customer_id	customer_name	gender_cd	gender	birth_day	age	address	application_store_cd	application_date	status_cd	m_longitude	m_latitude
136-0076	CS037613000071	六角雅彦	9	不明	1952-04-01	66	東京都江東区南砂**********	S13037	20150414	0-00000000-0	139.8350200000000000	35.6719300000000000
151-0053	CS031415000172	宇多田貴美子	1	女性	1976-10-04	42	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C	139.6896500000000000	35.6737400000000000
245-0016	CS028811000001	堀井かおり	1	女性	1933-03-27	86	神奈川県横浜市泉区和泉町**********	S14028	20160115	0-00000000-0	139.4836000000000000	35.3912500000000000
144-0055	CS001215000145	田崎美紀	1	女性	1995-03-29	24	東京都大田区仲六郷**********	S13001	20170605	6-20090929-2	139.7077500000000000	35.5408400000000000
136-0073	CS015414000103	奥野陽子	1	女性	1977-08-09	41	東京都江東区北砂**********	S13015	20150722	B-20100609-B	139.8360100000000000	35.6781800000000000
136-0073	CS015804000004	松谷米蔵	0	男性	1931-05-02	87	東京都江東区北砂**********	S13015	20150607	0-00000000-0	139.8360100000000000	35.6781800000000000
276-0022	CS007403000016	依田満	0	男性	1975-08-18	43	千葉県八千代市上高野**********	S12007	20150914	0-00000000-0	140.1326000000000000	35.7326400000000000
154-0015	CS035614000014	板倉菜々美	1	女性	1954-07-16	64	東京都世田谷区桜新町**********	S13035	20150804	0-00000000-0	139.6429700000000000	35.6288900000000000
223-0062	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9	139.6294600000000000	35.5537000000000000
226-0021	CS040412000191	川井郁恵	1	女性	1977-01-05	42	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4	139.5396400000000000	35.5276300000000000

Commentary : 

This code is written in SQL and performs a simple SELECT statement on the customer_1 table.

The SELECT statement retrieves all the columns and rows from the customer_1 table using the * wildcard character.

The LIMIT clause limits the number of rows returned by the query to the first 10 rows.

In summary, this code retrieves the first 10 rows of the customer_1 table, which contains all the columns from the customer table along with two additional columns for the average longitude and latitude of each customer's postal code.

P-086: For the customer data with latitude(latitude) and longitude(longitude) created in 085, combine it with shop data (df_store) using the member application shop code (application_store_cd) as key. Calculate the distance (in km) between the application shop and the customer address using the latitude(latitude) and longitude information (longitude) of the application shop and the latitude and longitude of the customer address (address), and display it together with the customer ID (customer_id), customer address and shop address (address). The following simple formula should be used as the calculation formula, but libraries using other highly accurate methods are also acceptable. Display 10 results.

Latitude（radians）：φLongitude（radians）：λDistance L=6371∗arccos⁡(sin⁡φ1∗sinφ2+cos⁡φ1∗cos⁡φ2∗cos⁡(λ1−λ2))

In [ ]:

%%sql

SELECT
  c.customer_id,
  c.address AS customer_address,
  s.address AS store_address,
  6371 * ACOS
    ( 
      SIN(RADIANS(c.m_latitude)) 
      * SIN(RADIANS(s.latitude))
      + COS(RADIANS(c.m_latitude)) 
      * COS(RADIANS(s.latitude)) 
      * COS(RADIANS(c.m_longitude) - RADIANS(s.longitude))
    ) AS distance 
FROM
  customer_1 c
JOIN
  store s
ON
  c.application_store_cd = s.store_cd
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	customer_address	store_address	distance
CS037613000071	東京都江東区南砂**********	東京都江東区南砂一丁目	1.4511822099658445
CS031415000172	東京都渋谷区代々木**********	東京都渋谷区初台二丁目	0.4117334789298223
CS028811000001	神奈川県横浜市泉区和泉町**********	神奈川県横浜市瀬谷区二ツ橋町	8.065196026704987
CS001215000145	東京都大田区仲六郷**********	東京都大田区仲六郷二丁目	1.2684209720729687
CS015414000103	東京都江東区北砂**********	東京都江東区南砂二丁目	1.449673414532165
CS015804000004	東京都江東区北砂**********	東京都江東区南砂二丁目	1.449673414532165
CS007403000016	千葉県八千代市上高野**********	千葉県佐倉市上志津	1.9208032538419055
CS035614000014	東京都世田谷区桜新町**********	東京都世田谷区用賀四丁目	1.0214681484997588
CS011215000048	神奈川県横浜市港北区日吉本町**********	神奈川県横浜市港北区日吉本町四丁目	0.8182767808775093
CS040412000191	神奈川県横浜市緑区北八朔町**********	神奈川県横浜市緑区長津田みなみ台五丁目	3.6641233580823287

Commentary :

This SQL code retrieves data from two tables, customer_1 and store, and calculates the distance between each customer's address and the corresponding store address.

The SELECT statement retrieves the following columns:

customer_id from the customer_1 table.
address from the customer_1 table, renamed to customer_address.
address from the store table, renamed to store_address.

A calculated value for the distance between the customer and the store, using the Haversine formula. The ACOS, SIN, and COS functions are used to calculate the arc-cosine of the sine and cosine of the latitudes and longitudes of both addresses, respectively. The result is then multiplied by the radius of the Earth (6371 km) to obtain the distance in kilometers.

The JOIN statement combines the rows from the customer_1 and store tables based on the matching application_store_cd and store_cd columns, respectively. The LIMIT clause limits the number of rows returned to 10.

S-087: In the customer data (df_customer), the same customer is registered more than once, e.g. due to applications at different shops. Create the named customer data, in which customers with the same name (customer_name) and postcode (postal_cd) are regarded as the same customer and are collated so that there is one record per customer, and calculate the number of customer data, the number of named customer data and the number of duplicates. However, for the same customer, the one with the highest total sales amount shall be retained, and for customers with the same total sales amount or no sales record, the one with the lowest customer ID (customer_id) number shall be retained.

In [ ]:

%%sql

DROP TABLE IF EXISTS customer_u;
CREATE TABLE customer_u AS (
    WITH sales_amount AS
    (
       SELECT
         c.customer_id,
         c.customer_name,
         c.postal_cd, 
         COALESCE(SUM(r.amount), 0) AS sum_amount
       FROM
         customer c
       LEFT JOIN
         receipt r        
       ON 
         c.customer_id = r.customer_id
       GROUP BY
         c.customer_id, 
         c.customer_name, 
         c.postal_cd
    ),
    sales_ranking AS
    (
      SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY customer_name, postal_cd ORDER BY sum_amount desc, customer_id ) AS ranking
      FROM 
        sales_amount
    )
    SELECT 
      c.*
    FROM
      customer c
    JOIN
      sales_ranking r
    ON
      c.customer_id = r.customer_id
    AND 
      r.ranking = 1
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
21941 rows affected.

Out[ ]:

[]

Commentary :

This SQL code creates a new table named "customer_u" by joining data from two tables "customer" and "receipt". It first defines a Common Table Expression (CTE) named "sales_amount" which computes the total sales amount for each customer by joining the "customer" table with the "receipt" table using a left join. The CTE also uses the COALESCE function to return 0 for customers who have no sales.

Then another CTE named "sales_ranking" is defined by computing the sales ranking for each customer based on the total sales amount, within each customer_name and postal_cd group. The ROW_NUMBER() function is used to assign a rank to each customer based on the sum_amount in descending order.

Finally, the SELECT statement joins the "customer" table with the "sales_ranking" CTE, filtering only the rows where the ranking is 1. This effectively selects only the top-ranking customer for each customer_name and postal_cd group. The resulting data is then inserted into a new table named "customer_u".

In [ ]:

%%sql

SELECT 
  customer_cnt, 
  customer_u_cnt, 
  customer_cnt - customer_u_cnt AS diff 
FROM
  (
   SELECT 
     COUNT(1) AS customer_cnt 
   FROM 
     customer
  ) customer
CROSS JOIN 
  (SELECT 
     COUNT(1) AS customer_u_cnt 
   FROM 
     customer_u
  ) customer_u
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

customer_cnt	customer_u_cnt	diff
21971	21941	30

Commentary :

This SQL code queries the number of customers in two different tables and calculates the difference between them.

The first part of the code is a subquery that selects the total count of customers from the "customer" table and aliases it as "customer_cnt".

The second part of the code is another subquery that selects the total count of customers from the "customer_u" table and aliases it as "customer_u_cnt".

Finally, the outer query selects the "customer_cnt" and "customer_u_cnt" from the two subqueries, and calculates the difference between them as "diff". This is done by subtracting the "customer_u_cnt" from the "customer_cnt".

The result is a single row with three columns - "customer_cnt", "customer_u_cnt", and "diff". The "customer_cnt" column displays the total number of customers in the "customer" table, "customer_u_cnt" column displays the total number of customers in the "customer_u" table, and "diff" column displays the difference between these two counts.

P-088: Based on the data created in 087, create data with integrated name IDs for the customer data. However, the integrated name ID shall be assigned according to the following specifications.

Non-duplicated customers: set customer ID (customer_id)

Duplicate customers: set the customer ID of the record extracted in the previous question

The difference between the number of unique customer IDs and the number of unique integration name-drop IDs shall also be checked.

In [ ]:

%%sql

DROP TABLE IF EXISTS customer_n;
CREATE TABLE customer_n AS 
(
  SELECT 
    c.*, 
    u.customer_id AS integration_id
  FROM 
    customer c
  JOIN
    customer_u u
  ON 
    c.customer_name = u.customer_name
  AND 
    c.postal_cd = u.postal_cd
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
21971 rows affected.

Out[ ]:

[]

Commentary :

This SQL code creates a new table called customer_n by joining the existing customer and customer_u tables.

The JOIN condition matches rows between customer and customer_u tables with the same values for customer_name and postal_cd columns. For these matching rows, the integration_id column of the resulting table customer_n is set to the customer_id from the customer_u table.

Therefore, the customer_n table includes all the rows from the customer table, along with an additional integration_id column that contains the customer_id values from the customer_u table. Rows that do not have matching customer_name and postal_cd values in the customer_u table will have a NULL value in the integration_id column.

In [ ]:

%%sql

SELECT 
  COUNT(1) AS ID difference in numbers 
FROM 
  customer_n
WHERE 
  customer_id != integration_id;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

Difference in the number of id
30

Commentary :

This SQL code queries the number of rows in the "customer_n" table where the "customer_id" column is not equal to the "integration_id" column.

The "customer_n" table was created in the previous code block by joining the "customer" table with the "customer_u" table using the customer's name and postal code as the join criteria. This effectively integrated the two tables by adding a new "integration_id" column to the "customer" table, which contains the corresponding "customer_id" value from the "customer_u" table.

The current query counts the number of rows in the "customer_n" table where the "customer_id" value is not equal to the "integration_id" value. This means that there are records in the "customer" table that do not have a corresponding record in the "customer_u" table or that have a different "customer_id" value in the "customer_u" table. This can be used to identify data integration issues and investigate the discrepancies further.

S-089: You want to split customers with sales records into training data and test data for building a forecasting model. Split the data randomly in the ratio of 8:2 respectively.

In [ ]:

%%sql

SELECT SETSEED(0.1);

CREATE TEMP TABLE IF NOT EXISTS sales_customer AS 
(
  SELECT
    customer_id ,
    ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row
  FROM
    customer
  JOIN
    receipt
    USING(customer_id)
    GROUP BY customer_id
    HAVING SUM(AMOUNT) > 0
);

DROP TABLE IF EXISTS customer_train;
CREATE TABLE customer_train AS
    SELECT
      customer.*
    FROM
      sales_customer
    JOIN
      customer 
    USING(customer_id)
    WHERE
      sales_customer.row <= (SELECT
                               COUNT(1) 
                             FROM 
                               sales_customer) * 0.8
;

DROP TABLE IF EXISTS customer_test;
CREATE TABLE customer_test AS
    SELECT
      customer.* 
    FROM
      sales_customer 
    JOIN
      customer
    USING(customer_id)
    EXCEPT
      SELECT
        * 
      FROM 
        customer_train
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.
8306 rows affected.
Done.
6644 rows affected.
Done.
1662 rows affected.

Out[ ]:

[]

Commentary :

This code is using SQL to create a training and testing dataset for a machine learning model based on customer sales data. Here is a breakdown of what each line does:

%%sql SELECT SETSEED(0.1);
This sets the seed for the random number generator used in the query. This is done to ensure that the same results are obtained each time the query is run.

CREATE TEMP TABLE IF NOT EXISTS sales_customer AS (
  SELECT 
    customer_id,
    ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row 
  FROM 
    customer 
    JOIN receipt USING(customer_id) 
  GROUP BY 
    customer_id 
  HAVING 
    SUM(AMOUNT) > 0 
);

This creates a temporary table called sales_customer that contains each customer's customer_id and a randomly assigned row number. It selects data from two tables: customer and receipt using JOIN. It groups data by customer_id and filters the result to only include customers who have made a purchase (SUM(AMOUNT) > 0).

DROP TABLE IF EXISTS customer_train;
CREATE TABLE customer_train AS 
  SELECT customer.* 
  FROM sales_customer 
  JOIN customer USING(customer_id) 
  WHERE sales_customer.row <= (SELECT COUNT(1) FROM sales_customer) * 0.8 ;

This creates a table called customer_train that contains 80% of the customers in the sales_customer table. It selects all columns from the customer table for customers whose row number in the sales_customer table is less than or equal to 80% of the total number of rows in the sales_customer table.

DROP TABLE IF EXISTS customer_test;
CREATE TABLE customer_test AS 
  SELECT customer.* 
  FROM sales_customer 
  JOIN customer USING(customer_id) 
  EXCEPT 
  SELECT * FROM customer_train ;

This creates a table called customer_test that contains the remaining 20% of the customers from the sales_customer table. It selects all columns from the customer table for customers whose customer_id is in the sales_customer table, but not in the customer_train table. This ensures that the test set contains customers that are not in the training set.

In [ ]:

%%sql

SELECT
  train_cnt * 1.0 / all_cnt as Percentage of training data,
  test_cnt * 1.0 / all_cnt as Test Data Percentage
FROM
  (SELECT 
    COUNT(1) AS all_cnt 
   FROM 
    sales_customer) all_data
CROSS JOIN
    (SELECT 
      COUNT(1) AS train_cnt 
     FROM 
       customer_train) train_data
CROSS JOIN
    (SELECT 
      COUNT(1) AS test_cnt 
     FROM customer_test) test_data
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

Percentage of training data	Percentage of test data
0.79990368408379484710	0.20009631591620515290

Commentary :

This code is calculating the percentage of data in the training set and the test set, based on the tables created in the previous SQL code. Here is a breakdown of what each line does:

%%sql SELECT train_cnt * 1.0 / all_cnt as Percentage of training data, test_cnt * 1.0 / all_cnt as Test Data Percentage

This is the main query, which selects the percentage of data in the training set and the test set. It uses train_cnt, test_cnt, and all_cnt variables, which are calculated in the following subqueries:

FROM 
  (SELECT COUNT(1) AS all_cnt FROM sales_customer) all_data 
  CROSS JOIN (SELECT COUNT(1) AS train_cnt FROM customer_train) train_data 
  CROSS JOIN (SELECT COUNT(1) AS test_cnt FROM customer_test) test_data ;

These subqueries calculate the total number of rows in the sales_customer table (all_cnt), the number of rows in the customer_train table (train_cnt), and the number of rows in the customer_test table (test_cnt).

The CROSS JOIN operator is used to combine these subqueries into a single table. Since each subquery returns only one row, the result of the cross join is a single row with three columns: all_cnt, train_cnt, and test_cnt.

The main query then calculates the percentage of data in the training set and the test set by dividing train_cnt and test_cnt by all_cnt, respectively. The * 1.0 expression is used to convert the result to a floating-point number, which ensures that the division is done with floating-point precision.

S-090: The receipt details data (df_receipt) has data from 1 Jan 2017 to 31 Oct 2019. Create three sets of data for building time-series models, 12 months for training and 6 months for testing, by aggregating the sales amount (amount) on a monthly basis.

In [ ]:

%%sql

-- Because it is not SQL-oriented, it is described somewhat aggressively (when the number of divisions is large, it is not realistic because SQL becomes long).
-- Also, be careful when the time series is detailed, such as data in seconds, and the data is long term, because the data is huge (in such cases, a language that can train models in a loop process is desirable).
-- (In such cases, it is desirable to use a language that enables model learning in a loop process.

-- As a preliminary step, total the sales amount by year and month, and assign sequential numbers.
CREATE TEMP TABLE IF NOT EXISTS ts_amount AS 
(
  SELECT 
    SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6) AS sales_ym, 
    SUM(amount) AS sum_amount,
    ROW_NUMBER() OVER(ORDER BY SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6)) AS rn
  FROM
    receipt
  GROUP BY 
    sales_ym
);

-- SQL has its limitations, but should be as reusable as possible as the number of datasets created increases.
-- LAG functions in WITH clauses are described in such a way that they can be used by changing the lag period.
DROP TABLE IF EXISTS series_data_1 ;
CREATE TABLE series_data_1 AS (
    WITH lag_amount AS 
    (
      SELECT
        sales_ym,
        sum_amount,
        LAG(rn, 0) OVER (ORDER BY rn) AS rn
      FROM 
        ts_amount
    )
    SELECT 
      sales_ym,
      sum_amount, 
      CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
    FROM 
      lag_amount 
    WHERE 
      rn BETWEEN 1 AND 18);


DROP TABLE IF EXISTS series_data_2 ;
CREATE TABLE series_data_2 AS (
    WITH lag_amount AS 
    (
      SELECT 
        sales_ym, 
        sum_amount, 
        LAG(rn, 6) OVER (ORDER BY rn) AS rn
      FROM 
        ts_amount
    )
    SELECT
      sales_ym, 
      sum_amount, 
      CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
    FROM 
      lag_amount
    WHERE 
      rn BETWEEN 1 AND 18);

DROP TABLE IF EXISTS series_data_3 ;
CREATE TABLE series_data_3 AS (
    WITH lag_amount AS 
    (
      SELECT
        sales_ym,
        sum_amount,
        LAG(rn, 12) OVER (ORDER BY rn) AS rn
      FROM 
        ts_amount
    )
    SELECT 
      sales_ym, 
      sum_amount, 
      CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
    FROM 
      lag_amount
    WHERE 
      rn BETWEEN 1 AND 18);

 * postgresql://padawan:***@db:5432/dsdojo_db
34 rows affected.
Done.
18 rows affected.
Done.
18 rows affected.
Done.
18 rows affected.

Out[ ]:

[]

Commentary :

This code creates three new tables (series_data_1, series_data_2, and series_data_3) by applying a window function to a summary table (ts_amount) of sales data. Here is a breakdown of what each line does:

%%sql
CREATE TEMP TABLE IF NOT EXISTS ts_amount AS 
(
  SELECT 
    SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6) AS sales_ym, 
    SUM(amount) AS sum_amount,
    ROW_NUMBER() OVER(ORDER BY SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6)) AS rn
  FROM
    receipt
  GROUP BY 
    sales_ym
);

This code creates a temporary table (ts_amount) that summarizes the sales data by month (using the SUBSTR function to extract the year and month from the sales_ymd column), calculates the total amount of sales (SUM(amount)), and assigns a row number to each row (ROW_NUMBER() OVER(...) AS rn). The row numbers are used in the subsequent queries to join the data across multiple time periods.

DROP TABLE IF EXISTS series_data_1 ;
CREATE TABLE series_data_1 AS (
    WITH lag_amount AS 
    (
      SELECT
        sales_ym,
        sum_amount,
        LAG(rn, 0) OVER (ORDER BY rn) AS rn
      FROM 
        ts_amount
    )
    SELECT 
      sales_ym,
      sum_amount, 
      CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
    FROM 
      lag_amount 
    WHERE 
      rn BETWEEN 1 AND 18);

This code creates a new table (series_data_1) by joining the sales data from the current period with the sales data from the previous period (using a window function and the LAG function), and assigning a binary flag (test_flg) to each row based on whether it belongs to the training set (test_flg = 0) or the test set (test_flg = 1). This is done by checking the row number (rn) against a threshold value (12), which splits the data into two parts: the first 12 rows are used for training, and the remaining 6 rows are used for testing. The resulting table has 18 rows in total. The code then repeats this process two more times, with different time periods:

DROP TABLE IF EXISTS series_data_2 ;
CREATE TABLE series_data_2 AS (
    WITH lag_amount AS (
        SELECT 
            sales_ym, 
            sum_amount, 
            LAG(rn, 6) OVER (ORDER BY rn) AS rn
        FROM ts_amount
    )
    SELECT
        sales_ym, 
        sum_amount, 
        CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
    FROM lag_amount
    WHERE rn BETWEEN 1 AND 18);

DROP TABLE IF EXISTS series_data_3 ;
CREATE TABLE series_data_3 AS (
    WITH lag_amount AS (
        SELECT
            sales_ym,
            sum_amount,
            LAG(rn, 12) OVER (ORDER BY rn) AS rn
        FROM ts_amount
    )
    SELECT 
        sales_ym, 
        sum_amount, 
        CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg
    FROM lag_amount
    WHERE rn BETWEEN 1 AND 18);

series_data_2 uses sales data from six months ago as the reference period, while series_data_3 uses sales data from one year ago.

In [ ]:

%%sql

-- series_data_2 and series_data_3 are omitted.
SELECT 
  * 
FROM 
  series_data_1
;

 * postgresql://padawan:***@db:5432/dsdojo_db
18 rows affected.

Out[ ]:

sales_ym	sum_amount	test_flg
201701	902056	0
201702	764413	0
201703	962945	0
201704	847566	0
201705	884010	0
201706	894242	0
201707	959205	0
201708	954836	0
201709	902037	0
201710	905739	0
201711	932157	0
201712	939654	0
201801	944509	1
201802	864128	1
201803	946588	1
201804	937099	1
201805	1004438	1
201806	1012329	1

Commentary :

This code is a SQL query that selects all columns and rows from the table series_data_1. The table series_data_1 is created in the previous SQL code block using the CREATE TABLE statement.

In the creation of series_data_1, the query creates a temporary table ts_amount which calculates the sum of amounts for each month and also assigns a row number based on the month. Using the row number and the LAG() window function, the query creates a new table series_data_1 that contains the sales amount for each month, and a flag indicating whether the row is part of the training data or test data.

Finally, in this code block, the query selects all columns and rows from series_data_1, which allows you to view the results of the previous SQL statements.

S-091: For each customer in the customer data (df_customer), under-sample the number of customers with sales performance and the number of customers without sales performance so that the ratio is 1:1.

In [ ]:

%%sql

SELECT SETSEED(0.1); 

CREATE TEMP TABLE IF NOT EXISTS down_sampling AS (
    WITH pre_table_1 AS
     (
        SELECT
           c.*,
           COALESCE(r.sum_amount,0) AS sum_amount
        FROM
          customer c
        LEFT JOIN 
        (
          SELECT
            customer_id,
            SUM(amount) AS sum_amount
          FROM
            receipt
          GROUP BY
            customer_id
        ) r 
        ON
          c.customer_id=r.customer_id
    )
    ,pre_table_2 AS(
        SELECT
            *
            ,CASE WHEN sum_amount > 0 THEN 1 ELSE 0 END  AS is_buy_flag
            ,CASE WHEN sum_amount = 0 THEN 1 ELSE 0 END  AS is_not_buy_flag
        FROM
            pre_table_1
    )
    ,pre_table_3 AS
    (
      SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY is_buy_flag ORDER BY RANDOM())
      FROM
        pre_table_2
      CROSS JOIN
        (SELECT SUM(is_buy_flag) AS buying FROM pre_table_2) AS t1
      CROSS JOIN
        (SELECT SUM(is_not_buy_flag) AS not_buying FROM pre_table_2) AS t2
    )
    SELECT
      *
    FROM
      pre_table_3
    WHERE
      row_number <= buying
      AND row_number <= not_buying
);

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.
16612 rows affected.

Out[ ]:

[]

Commentary :

This code is a SQL query that performs downsampling of customer data. Downsampling is a process of randomly removing some data points from a dataset to balance the classes or reduce the dataset size.

In the query, the first statement sets the random seed value. The next statement creates a temporary table called down_sampling that includes the customer data and the sum of amounts for each customer. The LEFT JOIN clause in the first CTE combines the customer table with the receipt table, calculating the sum of amounts for each customer.

In the second CTE, two additional columns are created using the CASE statement. These columns are flags that indicate whether a customer has made a purchase or not. Then, in the third CTE, the ROW_NUMBER() window function is used to assign a row number to each customer record based on the is_buy_flag column.

Finally, in the last statement of the query, the temporary table down_sampling is selected, but only the rows where the row number is less than or equal to the number of buying customers and the number of non-buying customers. This ensures that the number of rows for each class is balanced in the resulting table.

In [ ]:

%%sql

SELECT 
  is_buy_flag, 
  COUNT(1)
FROM 
  down_sampling 
GROUP BY 
  is_buy_flag
;

 * postgresql://padawan:***@db:5432/dsdojo_db
2 rows affected.

Out[ ]:

is_buy_flag	count
0	8306
1	8306

Commentary :

This code runs a SQL query that groups the records in the down_sampling temporary table by their is_buy_flag column and counts the number of records in each group.

The down_sampling table was created in the previous code cell by performing a left join between the customer table and a subquery that aggregates the amount column from the receipt table by customer_id. The resulting table has columns for customer information as well as the total sum of amounts purchased by each customer.

The pre_table_2 subquery creates two new columns based on the sum_amount column. The is_buy_flag column is set to 1 for customers who have made at least one purchase and 0 for customers who have not made any purchases. The is_not_buy_flag column is set to 1 for customers who have not made any purchases and 0 for customers who have made at least one purchase.

The pre_table_3 subquery generates a row number for each row in the pre_table_2 table based on the is_buy_flag column and a random order. It also calculates the total number of rows in the pre_table_2 table where is_buy_flag is 1 and where it is 0 using cross joins with two separate subqueries.

Finally, the main SELECT statement selects all rows from pre_table_3 where the row number is less than or equal to the total number of rows where is_buy_flag is 1 and where it is 0. This ensures that there are an equal number of records for each value of is_buy_flag.

Therefore, the output of the SELECT statement shows the count of records in the down_sampling table where is_buy_flag is 0 and where it is 1.

S-092: Normalise to third normal form for gender in customer data (df_customer).

In [ ]:

%%sql

DROP TABLE IF EXISTS customer_std;
CREATE TABLE customer_std AS 
(
  SELECT
    customer_id,
    customer_name,
    gender_cd,
    birth_day,
    age,
    postal_cd,
    application_store_cd,
    application_date,
    status_cd
  FROM
    customer
);

DROP TABLE IF EXISTS gender_std;
CREATE TABLE gender_std AS 
(
  SELECT DISTINCT         
    gender_cd, 
    gender
  FROM
    customer
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
21971 rows affected.
Done.
3 rows affected.

Out[ ]:

[]

Commentary :

This SQL code creates two new tables from an existing table named customer.

The first part of the code drops a table named customer_std if it exists and then creates a new table named customer_std. The new table has the same columns as the customer table, which are customer_id, customer_name, gender_cd, birth_day, age, postal_cd, application_store_cd, application_date, and status_cd. The purpose of creating this new table is not clear from the code provided. It might be created to standardize or preprocess the data.

The second part of the code drops a table named gender_std if it exists and then creates a new table named gender_std. The new table has two columns named gender_cd and gender. The gender_cd column is a distinct list of all gender codes found in the customer table, while the gender column is the corresponding gender description. This table might be used for mapping the gender codes to their respective gender descriptions.

In [ ]:

%%sql

-- Data content checks
SELECT 
  * 
FROM 
  customer_std 
LIMIT 
  3
;

 * postgresql://padawan:***@db:5432/dsdojo_db
3 rows affected.

Out[ ]:

customer_id	customer_name	gender_cd	birth_day	age	postal_cd	application_store_cd	application_date	status_cd
CS021313000114	大野あや子	1	1981-04-29	37	259-1113	S14021	20150905	0-00000000-0
CS037613000071	六角雅彦	9	1952-04-01	66	136-0076	S13037	20150414	0-00000000-0
CS031415000172	宇多田貴美子	1	1976-10-04	42	151-0053	S13031	20150529	D-20100325-C

Commentary : 

This code selects the first 3 rows from the customer_std table, which was created earlier in the code. The customer_std table is a copy of the customer table with a subset of columns, so this code is simply displaying the values of those columns for the first three customers in the table. This is a basic check to confirm that the data has been correctly imported into the new table.

In [ ]:

%%sql

-- Data content checks
SELECT 
  * 
FROM 
  gender_std 
LIMIT 
  3
;

 * postgresql://padawan:***@db:5432/dsdojo_db
3 rows affected.

Out[ ]:

gender_cd	gender
0	男性
9	不明
1	女性

Commentary :

This code executes an SQL query to select the first 3 rows from the gender_std table that was created in the previous code block.

The LIMIT clause is used to restrict the number of rows returned by the query to 3. Since the gender_std table was created using a SELECT DISTINCT statement, it will only contain unique values of gender_cd and their corresponding gender values from the customer table. Therefore, this query will return the first 3 unique rows from the gender_std table.

The output of this query will show the gender_cd and gender values for the first 3 rows of the gender_std table.

S-093: Product data (df_product) only holds the code values for each category, but not the category names. Combine with category data (df_category) and denormalise to create new product data that holds category names.

In [ ]:

%%sql

DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full AS 
(
  SELECT
    p.product_cd,
    p.category_major_cd,
    c.category_major_name,
    p.category_medium_cd,
    c.category_medium_name,
    p.category_small_cd,
    c.category_small_name,
    p.unit_price,
    p.unit_cost
  FROM
    product p
  JOIN
    category c
  USING(category_small_cd)
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
10030 rows affected.

Out[ ]:

[]

Commentary :

This code creates a new table called "product_full" by joining two tables "product" and "category".

The "product" table contains information about each product, such as its product code, category codes, unit price, and unit cost.

The "category" table contains category codes and category names for each category level.

The "JOIN" clause connects the two tables based on the category_small_cd column. By using the "USING" keyword, the query matches rows where the values in category_small_cd are equal in both tables.

The new table "product_full" includes all the columns from the "product" table, plus additional columns for category names, which are obtained by looking up the category codes in the "category" table.

In [ ]:

%%sql
-- Data content checks
SELECT 
  * 
FROM 
  product_full 
LIMIT 
  3
;

 * postgresql://padawan:***@db:5432/dsdojo_db
3 rows affected.

Out[ ]:

product_cd	category_major_cd	category_major_name	category_medium_cd	category_medium_name	category_small_cd	category_small_name	unit_price	unit_cost
P040101001	04	惣菜	0401	御飯類	040101	弁当類	198	149
P040101002	04	惣菜	0401	御飯類	040101	弁当類	218	164
P040101003	04	惣菜	0401	御飯類	040101	弁当類	230	173

Commentary :

This code selects data from two tables, product and category, and creates a new table called product_full.

The JOIN clause is used to join the two tables based on a common column, category_small_cd. This means that the resulting table will have only those rows where the value in category_small_cd is the same in both tables.

The columns selected from the two tables are: product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, and unit_cost.

The resulting table product_full will contain these columns for every row that meets the join criteria.

The LIMIT clause is used to restrict the number of rows displayed in the output to just 3.

S-094: File out the product data with category names created in 093 with the following specifications.

|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|UTF-8|

The path to the file output destination should be as follows (COPY command authorisation granted).

|Output destination| |:–:| |/tmp/data|

By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.

In [ ]:

%%sql

COPY product_full TO '/tmp/data/S_product_full_UTF-8_header.csv' 
WITH CSV HEADER ENCODING 'UTF-8';

 * postgresql://padawan:***@db:5432/dsdojo_db
10030 rows affected.

Out[ ]:

[]

Commentary : 

This code exports the contents of the product_full table to a CSV file named "S_product_full_UTF-8_header.csv" located in the "/tmp/data" directory. The COPY command is a PostgreSQL command that is used to copy data between files and tables. In this case, the TO keyword is used to specify the file to which the data is copied. The CSV keyword specifies the file format as comma-separated values, and the HEADER keyword specifies that the first row of the file should contain the column headers. Finally, the ENCODING keyword is used to specify the character encoding of the output file as UTF-8.

S-095: File out the product data with category names created in 093 with the following specifications.

|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|CP932|

In PostgreSQL, specifying Shift_JIS is equivalent to CP932. The path to the file output destination should be as follows (COPY command permissions have been granted).

|Output destination| |:–:| |/tmp/data|

By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.

In [ ]:

%%sql

COPY product_full TO '/tmp/data/S_product_full_SJIS_header.csv' 
WITH CSV HEADER ENCODING 'SJIS';

 * postgresql://padawan:***@db:5432/dsdojo_db
10030 rows affected.

Out[ ]:

[]

Commentary :

This code is using the COPY command in PostgreSQL to export the contents of the product_full table to a CSV file named S_product_full_SJIS_header.csv. The WITH CSV HEADER option specifies that the output should include a header row with column names, and the ENCODING 'SJIS' option specifies that the output file should be encoded using Shift-JIS character encoding.

Shift-JIS is a character encoding commonly used in Japan, and is one of several possible character encodings that can be used for CSV files. By exporting the data to a CSV file with Shift-JIS encoding, it allows the file to be easily imported into other applications or databases that require this specific encoding.

S-096: File out the product data with category names created in 093 with the following specifications.

|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|Without|UTF-8|

The path to the file output destination should be as follows (COPY command authorisation granted).

|Output destination| |:–:| |/tmp/data|

By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.

In [ ]:

%%sql

COPY product_full TO '/tmp/data/S_product_full_UTF-8_noh.csv' 
WITH CSV ENCODING 'UTF-8';

 * postgresql://padawan:***@db:5432/dsdojo_db
10030 rows affected.

Out[ ]:

[]

Commentary :

This code exports the contents of the product_full table to a CSV file named S_product_full_UTF-8_noh.csv in the /tmp/data/ directory without including the header row in the output file.

The COPY command is used in Postgres to copy data from/to a file. In this case, the product_full table is being copied to a file. The WITH CSV option specifies that the file is in comma-separated value (CSV) format. The ENCODING option specifies the character encoding of the file. The UTF-8 value indicates that the file uses UTF-8 encoding, which supports a wide range of characters from different languages.

Since the HEADER option is not specified, the output file does not include the header row. Therefore, the output file will only contain the rows of data from the product_full table in CSV format.

S-097: Load the file in the following format created in 094, display three data items and check that they have been imported correctly.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|

In [ ]:

%%sql

DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full 
(
  product_cd           VARCHAR(10),
  category_major_cd    VARCHAR(2),
  category_major_name  VARCHAR(20),
  category_medium_cd   VARCHAR(4),
  category_medium_name VARCHAR(20),
  category_small_cd    VARCHAR(6),
  category_small_name  VARCHAR(20),
  unit_price           INTEGER,
  unit_cost            INTEGER
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
Done.

Out[ ]:

[]

Commentary :

This SQL code drops the table product_full if it exists and creates a new table with the same name. The new table has eight columns: product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, and unit_cost. The first seven columns are of type VARCHAR with varying lengths, while the last two columns are of type INTEGER.

By specifying the data types and lengths of the columns, the code is defining the schema of the product_full table. This ensures that each column has a specific data type and length, which can improve the performance of queries on this table, as well as prevent errors that may occur if data of the wrong type or length is inserted.

In [ ]:

%%sql

COPY product_full FROM '/tmp/data/S_product_full_UTF-8_header.csv' 
WITH CSV HEADER ENCODING 'UTF-8';

 * postgresql://padawan:***@db:5432/dsdojo_db
10030 rows affected.

Out[ ]:

[]

Commentary :

This code imports data from a CSV file located at '/tmp/data/S_product_full_UTF-8_header.csv' into a new table called 'product_full' that has already been created in the database.

The COPY command in PostgreSQL allows you to copy data from a file to a table or vice versa. The FROM keyword is used to specify the path to the CSV file, and the WITH keyword is used to provide options for the COPY command. In this case, CSV specifies that the file is in Comma-Separated Value format, HEADER specifies that the first row of the file contains column names, and ENCODING 'UTF-8' specifies the character encoding used in the file.

Since the table 'product_full' has already been created in the database, the COPY command inserts the data from the CSV file into the existing table.

In [ ]:

%%sql

SELECT 
  * 
FROM 
  product_full 
LIMIT 
  3
;

 * postgresql://padawan:***@db:5432/dsdojo_db
3 rows affected.

Out[ ]:

product_cd	category_major_cd	category_major_name	category_medium_cd	category_medium_name	category_small_cd	category_small_name	unit_price	unit_cost
P040101001	04	惣菜	0401	御飯類	040101	弁当類	198	149
P040101002	04	惣菜	0401	御飯類	040101	弁当類	218	164
P040101003	04	惣菜	0401	御飯類	040101	弁当類	230	173

Commentary :

This SQL code is a simple SELECT statement that retrieves the first three rows of data from the product_full table. The data was previously loaded into the table using the COPY command from a CSV file. The query is using the LIMIT clause to limit the number of rows returned to 3.

The SELECT statement retrieves all columns from the product_full table. The table contains information about products, including their codes, categories, and prices. The data is organized into several columns, including product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, and unit_cost.

S-098: Load the file in the following format created in 096, display three data items and check that they are imported correctly.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|Without|UTF-8|

In [ ]:

%%sql

DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full 
(
  product_cd           VARCHAR(10),
  category_major_cd    VARCHAR(2),
  category_major_name  VARCHAR(20),
  category_medium_cd   VARCHAR(4),
  category_medium_name VARCHAR(20),
  category_small_cd    VARCHAR(6),
  category_small_name  VARCHAR(20),
  unit_price           INTEGER,
  unit_cost            INTEGER
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
Done.

Out[ ]:

[]

Commentary :

This code drops the table named product_full if it exists, and creates a new table with the same name and schema. The new table has 9 columns:

product_cd column is of type VARCHAR(10)
category_major_cd column is of type VARCHAR(2)
category_major_name column is of type VARCHAR(20)
category_medium_cd column is of type VARCHAR(4)
category_medium_name column is of type VARCHAR(20)
category_small_cd column is of type VARCHAR(6)
category_small_name column is of type VARCHAR(20)
unit_price column is of type INTEGER
unit_cost column is of type INTEGER

The VARCHAR data type is used to store variable-length character strings with the specified maximum length. The INTEGER data type is used to store whole numbers.

This code creates an empty table with the specified schema, and no data is added to the table.

In [ ]:

%%sql

COPY product_full FROM '/tmp/data/S_product_full_UTF-8_noh.csv' 
WITH CSV ENCODING 'UTF-8';

 * postgresql://padawan:***@db:5432/dsdojo_db
10030 rows affected.

Out[ ]:

[]

Commentary :

This code is importing data from a CSV file named 'S_product_full_UTF-8_noh.csv' into the table 'product_full' using the COPY command in PostgreSQL.

The WITH CSV option specifies that the data is in CSV format. The ENCODING option specifies the character encoding of the file, which is UTF-8 in this case.

Since the file doesn't have a header row, the column names and types are specified explicitly in a preceding SQL statement which creates the 'product_full' table.

Once the data is imported into the table, the SELECT statement is used to retrieve the first 3 rows of the 'product_full' table to confirm that the data has been imported correctly.

In [ ]:

%%sql

SELECT 
  * 
FROM 
  product_full 
LIMIT 
  3
;

 * postgresql://padawan:***@db:5432/dsdojo_db
3 rows affected.

Out[ ]:

product_cd	category_major_cd	category_major_name	category_medium_cd	category_medium_name	category_small_cd	category_small_name	unit_price	unit_cost
P040101001	04	惣菜	0401	御飯類	040101	弁当類	198	149
P040101002	04	惣菜	0401	御飯類	040101	弁当類	218	164
P040101003	04	惣菜	0401	御飯類	040101	弁当類	230	173

Commentary : 

This code selects the first 3 rows from the "product_full" table using SQL SELECT statement. The SELECT statement retrieves all columns (product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, unit_cost) from the "product_full" table and limits the output to the first 3 rows using the LIMIT keyword. The result shows the first 3 rows of the "product_full" table.

S-099: File out the product data with category names created in 093 with the following specifications.

|file format|with/without header|character encoding| |:–:|:–:|:–:| |TSV (Tab Separated Values)|With|UTF-8|

The path to the file output destination should be as follows (COPY command authorisation has been granted).

|Output destination| |:–:| |/tmp/data|

By specifying “/tmp/data”, the data is shared with Jupyter’s “/work/data”.

In [ ]:

%%sql

COPY product_full TO '/tmp/data/S_product_full_UTF-8_header.tsv' 
WITH CSV HEADER DELIMITER E'\t' ENCODING 'UTF-8';

 * postgresql://padawan:***@db:5432/dsdojo_db
10030 rows affected.

Out[ ]:

[]

Commentary :

This code exports the contents of the product_full table in CSV format, with a tab (\t) used as a delimiter instead of a comma. The exported file is saved in the /tmp/data directory and named S_product_full_UTF-8_header.tsv. The HEADER option is included, so the first row of the file will contain the column headers.

Here's a breakdown of the individual components of the code:

COPY product_full: specifies that the data from the product_full table will be copied to a file.

TO '/tmp/data/S_product_full_UTF-8_header.tsv': specifies the file path and name of the exported file.

WITH CSV: specifies that the exported file will be in CSV format.
HEADER: specifies that the first row of the file will contain the column headers.
DELIMITER E'\t': specifies that the tab character will be used as a delimiter.
ENCODING 'UTF-8': specifies the character encoding to use when exporting the file.

S-100: Load the file in the following format created in 099, display three items of data and check that they have been imported correctly.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|

In [ ]:

%%sql

DROP TABLE IF EXISTS product_full;
CREATE TABLE product_full 
(
 product_cd           VARCHAR(10),
 category_major_cd    VARCHAR(2),
 category_major_name  VARCHAR(20),
 category_medium_cd   VARCHAR(4),
 category_medium_name VARCHAR(20),
 category_small_cd    VARCHAR(6),
 category_small_name  VARCHAR(20),
 unit_price           INTEGER,
 unit_cost            INTEGER
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
Done.

Out[ ]:

[]

Commentary :

This SQL code block first drops the table named product_full if it exists, and then creates a new table with the same name and schema.

The schema of the product_full table has 9 columns with the following data types:

product_cd: VARCHAR(10) (i.e., a variable-length character string with a maximum length of 10)
category_major_cd: VARCHAR(2)
category_major_name: VARCHAR(20)
category_medium_cd: VARCHAR(4)
category_medium_name: VARCHAR(20)
category_small_cd: VARCHAR(6)
category_small_name: VARCHAR(20)
unit_price: INTEGER
unit_cost: INTEGER

The purpose of this code block is to define the structure of the product_full table before data is loaded into it.

In [ ]:

%%sql

COPY product_full FROM '/tmp/data/S_product_full_UTF-8_header.tsv' 
WITH CSV HEADER DELIMITER E'\t' ENCODING 'UTF-8';

 * postgresql://padawan:***@db:5432/dsdojo_db
10030 rows affected.

Out[ ]:

[]

Commentary :

This code copies data from a TSV (tab-separated values) file located at '/tmp/data/S_product_full_UTF-8_header.tsv' and inserts it into the 'product_full' table in the current database.

The COPY command specifies that the input file is a CSV file, with headers included in the first row of the file, and that the delimiter used is a tab character ('\t'). The ENCODING parameter specifies that the file is encoded in UTF-8.

This command assumes that the 'product_full' table already exists and has the same structure as the data in the input file (product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, unit_cost). The command will insert the data from the file into the table, creating new rows in the table for each row of data in the file.

In [ ]:

%%sql

SELECT 
  * 
FROM 
  product_full 
LIMIT 
3
;

 * postgresql://padawan:***@db:5432/dsdojo_db
3 rows affected.

Out[ ]:

product_cd	category_major_cd	category_major_name	category_medium_cd	category_medium_name	category_small_cd	category_small_name	unit_price	unit_cost
P040101001	04	惣菜	0401	御飯類	040101	弁当類	198	149
P040101002	04	惣菜	0401	御飯類	040101	弁当類	218	164
P040101003	04	惣菜	0401	御飯類	040101	弁当類	230	173

Commentary :

This code executes a SQL query to select the first three rows from the product_full table. The SELECT statement is followed by *, which means to select all columns from the table. LIMIT 3 limits the result set to the first three rows of the table.

The result will display the data in the product_full table according to the columns specified in the table. The exact output will depend on the data stored in the table.

This is the end of the 100. Well done!

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - SQL part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - SQL part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - SQL part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - SQL part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - SQL part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.