参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_SQL.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction
How to use
100 data processing knocks

Introduction

The database is PostgreSQL13
First, execute the following cell
You can issue SQL by writing %%sql in the cell
You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
Names, addresses etc. are dummy data and do not exist.

In [ ]:

%load_ext sql
import os

pgconfig = {
    'host': 'db',
    'port': os.environ['PG_PORT'],
    'database': os.environ['PG_DATABASE'],
    'user': os.environ['PG_USER'],
    'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)

# Set up Magic commands to write SQL
%sql $dsl

Out[ ]:

'Connected: padawan@dsdojo_db'

How to use

You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.

In [ ]:

%%sql
SELECT 'You can run it like this' AS sample;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

sample
You can run it like this

100 data processing knocks

S-021: Count the number of cases against the receipt details data (df_receipt).

In [ ]:

%%sql
-- Code Example 1
SELECT 
  COUNT(1) 
FROM 
  receipt
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

count
104681

Commentary :

The code is a SQL query that selects the count of rows from a table called "receipt".

The "SELECT" keyword is used to indicate that we want to retrieve data from the database.

The "COUNT" function is used to count the number of rows in the "receipt" table. In this case, the "COUNT(1)" statement is used to count the number of rows in the table where the value in the first column of each row is not null. This is equivalent to using "COUNT(*)", which counts all rows in the table regardless of whether any columns are null.

The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.

So, in summary, the query selects and returns the count of rows in the "receipt" table.

In [ ]:

%%sql
-- Code example 2 (* is also acceptable).
SELECT 
  COUNT(*) 
FROM 
  receipt
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

count
104681

Commentary :

The code is a SQL query that selects the count of rows from a table called "receipt".

The "SELECT" keyword is used to indicate that we want to retrieve data from the database.

The "COUNT" function is used to count the number of rows in the "receipt" table. In this case, the "COUNT(*)" statement is used to count all rows in the table, regardless of whether any columns are null or not.

The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.

So, in summary, the query selects and returns the count of all rows in the "receipt" table, including those with null values in any columns.

S-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).

In [ ]:

%%sql

SELECT
  COUNT(DISTINCT customer_id)
FROM 
  receipt
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

count
8307

Commentary :

This code is a SQL query that selects the count of distinct customer IDs from a table called "receipt".

The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.

The "SELECT" keyword is used to indicate that we want to retrieve data from the database.

The "COUNT" function is used to count the number of distinct customer IDs in the "receipt" table. In this case, the "COUNT(DISTINCT customer_id)" statement is used to count the number of unique customer IDs in the table.

The "DISTINCT" keyword is used to specify that only unique values of the "customer_id" column should be counted.

The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.

So, in summary, the query selects and returns the count of unique customer IDs in the "receipt" table.

S-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).

In [ ]:

%%sql

SELECT 
  store_cd, 
  SUM(amount) AS amount, 
  SUM(quantity) AS quantity
FROM 
  receipt
Group By
  store_cd
;

 * postgresql://padawan:***@db:5432/dsdojo_db
52 rows affected.

Out[ ]:

store_cd	amount	quantity
S12007	638761	2099
S13017	748221	2376
S13043	587895	1881
S13052	100314	250
S13016	793773	2432
S14027	714550	2303
S13009	808870	2486
S14022	651328	2047
S13019	827833	2541
S13039	611888	1981
S14046	412646	1354
S13003	764294	2197
S14028	786145	2458
S14045	458484	1398
S13002	727821	2340
S14042	534689	1935
S13004	779373	2390
S13038	708884	2337
S12014	725167	2358
S14021	699511	2231
S14006	712839	2284
S13001	811936	2347
S14023	727630	2258
S14025	755581	2394
S13035	715869	2219
S14048	234276	769
S14012	720600	2412
S14024	736323	2417
S13041	728266	2233
S14026	824537	2503
S14010	790361	2290
S13015	780873	2248
S13008	809288	2491
S14049	230808	788
S14050	167090	580
S13031	705968	2336
S12013	787513	2425
S13044	520764	1729
S14033	725318	2282
S12029	794741	2555
S13037	693087	2344
S12030	684402	2403
S13032	790501	2491
S14034	653681	2024
S14040	701858	2233
S13005	629876	2004
S14047	338329	1041
S14011	805724	2434
S13018	790535	2562
S13020	796383	2383
S13051	107452	354
S14036	203694	635

Commentary :

SQL code is repeated twice.

The code selects three columns, "store_cd", "amount", and "quantity" from a table called "receipt". The table contains sales data for a store.

The "SUM" function is used to calculate the sum of the "amount" and "quantity" columns for each store.

The "GROUP BY" clause is used to group the sales data by store. This means that the query will return the sum of "amount" and "quantity" for each unique store code in the "store_cd" column.

The first occurrence of the code might be a duplication or a mistake, as it is the exact same as the second occurrence. Nonetheless, the query selects and returns the total amount and quantity of products sold for each store in the "receipt" table.

S-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).

In [ ]:

%%sql

SELECT
  customer_id,
  MAX(sales_ymd)
FROM 
  receipt
GROUP BY 
  customer_id
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	max
CS001311000059	20180211
CS004614000122	20181228
CS003512000043	20180106
CS011615000061	20190503
CS029212000033	20180621
CS007515000119	20190511
CS034515000123	20190708
CS004315000058	20170517
CS026414000014	20190720
CS001615000099	20170729

Commentary :

This code is a SQL query that selects the maximum sales date (sales_ymd) for each customer from a table called "receipt".

The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.

The "SELECT" keyword is used to indicate that we want to retrieve data from the database.

The "customer_id" and "MAX(sales_ymd)" are the columns that we want to select from the "receipt" table. The "MAX" function is used to select the maximum value of the "sales_ymd" column for each unique customer ID in the "customer_id" column.

The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.

The "GROUP BY" clause is used to group the sales data by customer ID. This means that the query will return the maximum sales date for each unique customer ID in the "customer_id" column.

The "LIMIT" keyword is used to limit the number of results returned by the query. In this case, the "LIMIT 10" statement limits the results to the top 10 rows.

So, in summary, the query selects and returns the maximum sales date for each customer from the "receipt" table and limits the results to the top 10 customers.

S-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.

In [ ]:

%%sql

SELECT
  customer_id,
  MIN(sales_ymd)
FROM 
  receipt
GROUP BY 
  customer_id
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	min
CS001311000059	20180211
CS004614000122	20181228
CS003512000043	20180106
CS011615000061	20190503
CS029212000033	20170318
CS007515000119	20170201
CS034515000123	20170527
CS004315000058	20170517
CS026414000014	20170718
CS001615000099	20170729

Commentary :

This code is a SQL query that selects the minimum sales date (sales_ymd) for each customer from a table called "receipt".

The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.

The "SELECT" keyword is used to indicate that we want to retrieve data from the database.

The "customer_id" and "MIN(sales_ymd)" are the columns that we want to select from the "receipt" table. The "MIN" function is used to select the minimum value of the "sales_ymd" column for each unique customer ID in the "customer_id" column.

The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.

The "GROUP BY" clause is used to group the sales data by customer ID. This means that the query will return the minimum sales date for each unique customer ID in the "customer_id" column.

The "LIMIT" keyword is used to limit the number of results returned by the query. In this case, the "LIMIT 10" statement limits the results to the top 10 rows.

So, in summary, the query selects and returns the minimum sales date for each customer from the "receipt" table and limits the results to the top 10 customers.

S-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.

In [ ]:

%%sql

SELECT
  customer_id,
  MAX(sales_ymd),
  MIN(sales_ymd)
FROM 
  receipt
GROUP BY 
  customer_id
HAVING 
  MAX(sales_ymd) != MIN(sales_ymd)
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	max	min
CS029212000033	20180621	20170318
CS007515000119	20190511	20170201
CS034515000123	20190708	20170527
CS026414000014	20190720	20170718
CS010515000082	20181204	20180518
CS019315000045	20170920	20170423
CS008513000099	20190308	20170722
CS007615000070	20191025	20170929
CS025415000155	20191026	20170314
CS016414000063	20190617	20170109

Commentary :

This code is a SQL query that selects the customer IDs, maximum sales date, and minimum sales date from a table called "receipt".

The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.

The "SELECT" keyword is used to indicate that we want to retrieve data from the database.

The "customer_id", "MAX(sales_ymd)", and "MIN(sales_ymd)" are the columns that we want to select from the "receipt" table. The "MAX" function is used to select the maximum value of the "sales_ymd" column for each unique customer ID in the "customer_id" column, and the "MIN" function is used to select the minimum value of the "sales_ymd" column for each unique customer ID in the "customer_id" column.

The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.

The "GROUP BY" clause is used to group the sales data by customer ID. This means that the query will return the maximum and minimum sales dates for each unique customer ID in the "customer_id" column.

The "HAVING" clause is used to filter the results based on a condition. In this case, the condition is that the maximum sales date and minimum sales date should not be equal for a given customer ID. This means that the query will only return customers who have made purchases on different dates.

The "LIMIT" keyword is used to limit the number of results returned by the query. In this case, the "LIMIT 10" statement limits the results to the top 10 rows.

So, in summary, the query selects and returns the customer IDs, maximum sales date, and minimum sales date from the "receipt" table for customers who have made purchases on different dates and limits the results to the top 10 customers.

S-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.

In [ ]:

%%sql

SELECT
  store_cd,
  AVG(amount) AS avg_amount
FROM 
  receipt
GROUP BY 
  store_cd
ORDER BY 
  avg_amount DESC
LIMIT 
  5
;

 * postgresql://padawan:***@db:5432/dsdojo_db
5 rows affected.

Out[ ]:

store_cd	avg_amount
S13052	402.8674698795180723
S13015	351.1119604316546763
S13003	350.9155188246097337
S14010	348.7912621359223301
S13001	348.4703862660944206

Commentary :

This code is a SQL query that selects the store codes and their average sales amount from a table called "receipt".

The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.

The "SELECT" keyword is used to indicate that we want to retrieve data from the database.

The "store_cd" and "AVG(amount) AS avg_amount" are the columns that we want to select from the "receipt" table. The "AVG" function is used to calculate the average sales amount for each unique store code in the "store_cd" column. The "AS" keyword is used to give the column an alias, "avg_amount", which is the name that will be displayed in the query results.

The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.

The "GROUP BY" clause is used to group the sales data by store code. This means that the query will calculate the average sales amount for each unique store code in the "store_cd" column.

The "ORDER BY" clause is used to sort the results in descending order based on the average sales amount. This means that the stores with the highest average sales amount will be displayed at the top of the results.

The "LIMIT" keyword is used to limit the number of results returned by the query. In this case, the "LIMIT 5" statement limits the results to the top 5 stores.

So, in summary, the query selects and returns the store codes and their average sales amount from the "receipt" table, sorted in descending order based on the average sales amount, and limits the results to the top 5 stores.

S-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.

In [ ]:

%%sql

SELECT 
  store_cd, 
  PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY amount) AS amount_50per
FROM 
  receipt
GROUP BY 
  store_cd
ORDER BY 
  amount_50per DESC
LIMIT 
  5
;

 * postgresql://padawan:***@db:5432/dsdojo_db
5 rows affected.

Out[ ]:

store_cd	amount_50per
S13052	190.0
S14010	188.0
S14050	185.0
S13003	180.0
S13018	180.0

Commentary :

This is a SQL query that retrieves data from a table called "receipt".

The query selects two columns from the table: "store_cd" and "amount_50per". The "store_cd" column represents the store code, and the "amount_50per" column calculates the median amount of money spent per receipt within each store.

The function used to calculate the median is "PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY amount)". This function orders the "amount" column in ascending order and then calculates the value at the 50th percentile. In other words, it finds the middle value of the ordered list of amounts.

The query groups the results by store code using the "GROUP BY" clause, and orders the results in descending order of the median amount using the "ORDER BY" clause. It then limits the output to the top 5 results using the "LIMIT" clause.

So, the query returns the top 5 stores with the highest median amount spent per receipt.

S-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.

In [ ]:

%%sql

-- Code example 1: tabulating mode with window and analytic functions
WITH product_cnt AS (
    SELECT
      store_cd,
      product_cd,
      COUNT(1) AS mode_cnt
    FROM 
      receipt
    GROUP BY
      store_cd,
      product_cd
),
product_mode AS (
    SELECT
      store_cd,
      product_cd,
      mode_cnt,
      RANK() OVER(PARTITION BY store_cd ORDER BY mode_cnt DESC) AS rnk
    FROM 
      product_cnt
)
SELECT
  store_cd,
  product_cd,
  mode_cnt
FROM 
  product_mode
WHERE
  rnk = 1
ORDER BY
  store_cd,
  product_cd
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

store_cd	product_cd	mode_cnt
S12007	P060303001	72
S12013	P060303001	107
S12014	P060303001	65
S12029	P060303001	92
S12030	P060303001	115
S13001	P060303001	67
S13002	P060303001	78
S13003	P071401001	65
S13004	P060303001	88
S13005	P040503001	36

Commentary :

This is a SQL query that retrieves data from a table called "receipt".

The query uses a Common Table Expression (CTE) called "product_cnt" to calculate the number of times each product has been sold at each store. It groups the data by "store_cd" and "product_cd" using the "GROUP BY" clause and counts the number of rows in each group using the "COUNT(1)" function. The results are stored in the "mode_cnt" column.

The query then uses another CTE called "product_mode" to rank the "mode_cnt" values for each store's products using the "RANK() OVER(PARTITION BY store_cd ORDER BY mode_cnt DESC)" function. This function assigns a rank to each product based on its "mode_cnt" value, within each store. The products with the highest mode count are assigned a rank of 1.

Finally, the query selects the top-selling product for each store using the "WHERE rnk = 1" clause to filter the results, and orders them by store and product code using the "ORDER BY" clause. It returns the store code, product code, and mode count for the top-selling product at each store, and limits the output to the first 10 rows using the "LIMIT" clause.

In summary, this query finds the top-selling product at each store, based on the number of times it has been sold, and returns the store code, product code, and mode count for each of the top 10 products.

In [ ]:

%%sql

-- Code example 2: simple case of using MODE() (fast, but only one mode is selected if there is more than one mode)
SELECT
  store_cd,
  MODE() WITHIN GROUP(ORDER BY product_cd)
FROM 
  receipt
GROUP BY 
  store_cd
ORDER BY
  store_cd
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

store_cd	mode
S12007	P060303001
S12013	P060303001
S12014	P060303001
S12029	P060303001
S12030	P060303001
S13001	P060303001
S13002	P060303001
S13003	P071401001
S13004	P060303001
S13005	P040503001

Commentary :

This is a SQL query that retrieves data from a table called "receipt".

The query selects two columns: "store_cd" and the mode of the "product_cd" column for each store.

The "MODE() WITHIN GROUP(ORDER BY product_cd)" function calculates the mode of the "product_cd" values within each group of "store_cd". The mode is the most frequently occurring value in a set of data. The "ORDER BY" clause specifies that the "product_cd" values should be ordered in ascending order before calculating the mode.

The query groups the data by "store_cd" using the "GROUP BY" clause and orders the results by "store_cd" using the "ORDER BY" clause. It limits the output to the first 10 rows using the "LIMIT" clause.

So, the query returns the mode of the "product_cd" values for each store, which represents the most commonly sold product at each store. The results are ordered by store code and limited to the top 10 stores.

S-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.

In [ ]:

%%sql

SELECT
  store_cd,
  VAR_POP(amount) AS vars_amount
FROM 
  receipt
GROUP BY 
  store_cd
ORDER BY 
  vars_amount DESC 
LIMIT 
  5
;

 * postgresql://padawan:***@db:5432/dsdojo_db
5 rows affected.

Out[ ]:

store_cd	vars_amount
S13052	440088.701311269173
S14011	306314.558163888889
S14034	296920.081011283873
S13001	295431.993329035348
S13015	295294.361115940880

Commentary :

This is a SQL query that retrieves data from a table called "receipt".

The query selects two columns: "store_cd" and the population variance of the "amount" column for each store.

The "VAR_POP(amount)" function calculates the variance of the "amount" values within each group of "store_cd". The variance is a measure of the spread or dispersion of a set of data. The "POP" in "VAR_POP" stands for "population", which means that the entire dataset is used to calculate the variance.

The query groups the data by "store_cd" using the "GROUP BY" clause and orders the results by the "vars_amount" column, which contains the variance of the "amount" values for each store, in descending order using the "ORDER BY" clause. It limits the output to the first 5 rows using the "LIMIT" clause.

So, the query returns the top 5 stores with the highest population variance in the "amount" values, which means that these stores have the highest spread or dispersion of the amount spent per receipt.

S-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.

In [ ]:

%%sql

SELECT
  store_cd,
  STDDEV_POP(amount) as stds_amount
FROM 
  receipt
GROUP BY 
  store_cd
ORDER BY 
  stds_amount DESC
LIMIT 
  5
;

 * postgresql://padawan:***@db:5432/dsdojo_db
5 rows affected.

Out[ ]:

store_cd	stds_amount
S13052	663.391815830787
S14011	553.456916267101
S14034	544.903735545357
S13001	543.536561170484
S13015	543.409938366921

Commentary :

This is a SQL query that retrieves data from a table called "receipt".

The query selects two columns: "store_cd" and the population standard deviation of the "amount" column for each store.

The "STDDEV_POP(amount)" function calculates the standard deviation of the "amount" values within each group of "store_cd". The standard deviation is a measure of the spread or dispersion of a set of data. The "POP" in "STDDEV_POP" stands for "population", which means that the entire dataset is used to calculate the standard deviation.

The query groups the data by "store_cd" using the "GROUP BY" clause and orders the results by the "stds_amount" column, which contains the standard deviation of the "amount" values for each store, in descending order using the "ORDER BY" clause. It limits the output to the first 5 rows using the "LIMIT" clause.

So, the query returns the top 5 stores with the highest population standard deviation in the "amount" values, which means that these stores have the highest spread or dispersion of the amount spent per receipt.

S-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.

In [ ]:

%%sql

SELECT
  PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY amount) AS amount_25per,
  PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY amount) AS amount_50per,
  PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY amount) AS amount_75per,
  PERCENTILE_CONT(1.0) WITHIN GROUP(ORDER BY amount) AS amount_100per
FROM 
  receipt
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

amount_25per	amount_50per	amount_75per	amount_100per
102.0	170.0	288.0	10925.0

Commentary :

This is SQL code used to calculate percentile values for the "amount" column in the "receipt" table. The code uses the PERCENTILE_CONT function to compute continuous percentiles, which means that the percentile values will correspond to actual values in the data set rather than interpolated values between data points.

The code specifies four percentile values to be computed: the 25th percentile (amount_25per), the 50th percentile (amount_50per), the 75th percentile (amount_75per), and the 100th percentile (amount_100per). These percentiles are calculated using the WITHIN GROUP clause to order the "amount" column in ascending order before computing the percentiles.

The SELECT statement at the end of the code retrieves the calculated percentile values from the query and returns them as a single row of output. This code is useful for analyzing the distribution of the "amount" column in the "receipt" table and identifying any outliers or unusual patterns in the data.

S-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.

In [ ]:

%%sql

SELECT
  store_cd,
  AVG(amount) AS avg_amount
FROM 
  receipt
GROUP BY 
  store_cd
HAVING
  AVG(amount) >= 330
;

 * postgresql://padawan:***@db:5432/dsdojo_db
13 rows affected.

Out[ ]:

store_cd	avg_amount
S13052	402.8674698795180723
S13019	330.2086158755484643
S13003	350.9155188246097337
S14045	330.0820734341252700
S13004	330.9439490445859873
S13001	348.4703862660944206
S14026	332.3405884723901653
S14010	348.7912621359223301
S13015	351.1119604316546763
S12013	330.1941299790356394
S14047	330.0770731707317073
S14011	335.7183333333333333
S13020	337.8799321170980059

Commentary :

This is SQL code used to retrieve average transaction amounts for each store in the "receipt" table and filter the results based on a condition.

The code begins with a SELECT statement that specifies two columns to be retrieved: the "store_cd" column and the average of the "amount" column for each store. The AVG function is used to calculate the average of the "amount" column for each store, and the AS keyword is used to assign the alias "avg_amount" to this calculated column.

The next line of code uses the GROUP BY clause to group the data by the "store_cd" column, which means that the average transaction amount will be calculated separately for each store.

The code then uses the HAVING clause to filter the results based on a condition. Specifically, the condition requires that the average transaction amount for each store be greater than or equal to 330. Any stores that do not meet this condition will not be included in the output.

The resulting output will be a table with two columns: "store_cd" and "avg_amount". Each row will correspond to a store in the "receipt" table that meets the condition specified in the HAVING clause, and the "avg_amount" column will contain the average transaction amount for that store. This code is useful for analyzing the performance of different stores based on the average transaction amount.

S-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

%%sql

WITH customer_amount AS (
    SELECT
      customer_id,
      SUM(amount) AS sum_amount
    FROM 
      receipt
    WHERE
      customer_id NOT LIKE 'Z%'
    GROUP BY 
      customer_id
)
SELECT
  AVG(sum_amount)
FROM 
  customer_amount
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

avg
2547.7422345292559595

Commentary :

This is SQL code that calculates the average of the total amount spent by each customer in the "receipt" table, excluding customers whose IDs start with the letter "Z".

The code starts with a common table expression (CTE) defined using the WITH clause. The CTE is named "customer_amount" and it contains a SELECT statement that retrieves the "customer_id" column and the sum of the "amount" column for each customer in the "receipt" table. The WHERE clause filters out any customers whose IDs start with the letter "Z" using the NOT LIKE operator. The results are grouped by the "customer_id" column, and the resulting table has two columns: "customer_id" and "sum_amount", which is the total amount spent by each customer.

The next line of code retrieves the average of the "sum_amount" column from the "customer_amount" CTE. The SELECT statement uses the AVG function to calculate the average of the "sum_amount" column, which represents the average total amount spent by each customer whose ID does not start with "Z".

The resulting output will be a single value representing the average of the "sum_amount" column from the "customer_amount" CTE. This code is useful for analyzing the average spending patterns of customers and identifying any outliers or unusual patterns in the data.

S-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

%%sql

WITH customer_amount AS (
    SELECT
      customer_id,
      SUM(amount) AS sum_amount
    FROM 
      receipt
    WHERE
      customer_id NOT LIKE 'Z%'
    GROUP BY 
      customer_id
)
SELECT
  customer_id,
  sum_amount
FROM 
  customer_amount
WHERE
  sum_amount >= (
      SELECT
        AVG(sum_amount)
      FROM 
        customer_amount
    )
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount
CS029212000033	3604
CS007515000119	7157
CS034515000123	3699
CS026414000014	6671
CS007615000070	2975
CS016414000063	6207
CS012514000018	2562
CS029515000142	3420
CS015215000021	3090
CS039814000011	8031

Commentary :

This is SQL code that retrieves the "customer_id" and "sum_amount" columns for the top 10 customers who have spent more than the average total amount spent by customers in the "receipt" table, excluding customers whose IDs start with the letter "Z".

The code starts with a common table expression (CTE) defined using the WITH clause. The CTE is named "customer_amount" and it contains a SELECT statement that retrieves the "customer_id" column and the sum of the "amount" column for each customer in the "receipt" table. The WHERE clause filters out any customers whose IDs start with the letter "Z" using the NOT LIKE operator. The results are grouped by the "customer_id" column, and the resulting table has two columns: "customer_id" and "sum_amount", which is the total amount spent by each customer.

The next line of code retrieves the "customer_id" and "sum_amount" columns from the "customer_amount" CTE, but only for customers whose "sum_amount" is greater than or equal to the average "sum_amount" calculated over all customers in the CTE. This is achieved using a subquery that calculates the average "sum_amount" using the AVG function and then filters the "customer_amount" CTE using the WHERE clause.

The results of this query are then limited to the top 10 rows using the LIMIT clause.

The resulting output will be a table with two columns: "customer_id" and "sum_amount". Each row will correspond to a customer who has spent more than the average total amount spent by customers in the "receipt" table, excluding customers whose IDs start with the letter "Z". The table will be sorted in descending order based on the "sum_amount" column, and only the top 10 rows will be included in the output. This code is useful for identifying high-spending customers and analyzing their spending patterns.

S-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.

In [ ]:

%%sql

SELECT
  r.*,
  s.store_name
FROM 
  receipt r
JOIN 
  store s
ON
  r.store_cd = s.store_cd
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount	store_name
20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158	葛が谷店
20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81	成城店
20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170	二ツ橋店
20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25	新山下店
20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90	大和店
20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138	狛江店
20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30	三田店
20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128	長津田店
20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770	十条仲原店
20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680	南藤沢店

Commentary :

This is SQL code that retrieves the first 10 rows from a joined table containing data from the "receipt" and "store" tables. The resulting table will contain all columns from the "receipt" table, plus an additional column "store_name" from the "store" table.

The code starts with a SELECT statement that specifies the columns to retrieve. In this case, the SELECT statement uses the wildcard (*) to retrieve all columns from the "receipt" table. The SELECT statement also specifies the "store_name" column from the "store" table, which will be included in the output table.

The FROM clause specifies that the data is being retrieved from the "receipt" table, and the JOIN clause specifies that the "store" table is being joined to the "receipt" table using the "store_cd" column as the join condition. This means that rows from the "receipt" table and the "store" table will be combined into a single table based on matching values in the "store_cd" column.

The resulting output will be a table containing all columns from the "receipt" table, plus an additional column "store_name" from the "store" table. The table will be limited to the first 10 rows using the LIMIT clause. This code is useful for analyzing data from two related tables and combining the data into a single table for analysis.

S-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.

In [ ]:

%%sql

SELECT
  p.*,
  c.category_small_name
FROM 
  product p
JOIN 
  category c
ON
  p.category_small_cd = c.category_small_cd
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_small_name
P040101001	04	0401	040101	198	149	弁当類
P040101002	04	0401	040101	218	164	弁当類
P040101003	04	0401	040101	230	173	弁当類
P040101004	04	0401	040101	248	186	弁当類
P040101005	04	0401	040101	268	201	弁当類
P040101006	04	0401	040101	298	224	弁当類
P040101007	04	0401	040101	338	254	弁当類
P040101008	04	0401	040101	420	315	弁当類
P040101009	04	0401	040101	498	374	弁当類
P040101010	04	0401	040101	580	435	弁当類

Commentary :

This code is written in SQL, a programming language used for managing and querying relational databases.

The code selects data from two tables: "product" and "category". The "JOIN" keyword is used to combine the rows from both tables where the "category_small_cd" column values match. This creates a new table that contains columns from both tables.

The "SELECT" keyword is used to choose which columns to include in the output. In this case, it selects all columns from the "product" table (specified by the "p.*" syntax) and the "category_small_name" column from the "category" table.

The "LIMIT" keyword limits the output to the first 10 rows. This is useful when working with large datasets to quickly test queries or get a sample of the data.

So, this code will return the first 10 rows of the combined "product" and "category" tables, where the "category_small_cd" values match, and will include all columns from the "product" table and the "category_small_name" column from the "category" table.

S-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.

In [ ]:

%%sql

WITH customer_amount AS (
  SELECT
    customer_id,
    SUM(amount) AS sum_amount
  FROM 
    receipt
  GROUP BY
    customer_id
),
customer_data AS (
    SELECT
      customer_id
    FROM 
      customer
    WHERE
      gender_cd = '1'
    AND 
      customer_id NOT LIKE 'Z%'
)
SELECT
  c.customer_id,
  COALESCE(a.sum_amount, 0)
FROM 
  customer_data c
LEFT OUTER JOIN 
  customer_amount a
ON
  c.customer_id = a.customer_id
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	coalesce
CS021313000114	0
CS031415000172	5088
CS028811000001	0
CS001215000145	875
CS015414000103	3122
CS033513000180	868
CS035614000014	0
CS011215000048	3444
CS009413000079	0
CS040412000191	210

Commentary :

This code is also written in SQL and is a bit more complex than the previous one. It selects data from three tables: "receipt", "customer" and "customer_amount".

The code uses a "WITH" clause to define two subqueries: "customer_amount" and "customer_data".

The "customer_amount" subquery aggregates the "amount" column from the "receipt" table by "customer_id" using the "SUM" function and gives an alias "sum_amount" to the result. This subquery will create a new table that contains two columns: "customer_id" and "sum_amount".

The "customer_data" subquery selects all rows from the "customer" table where "gender_cd" is equal to '1' and the "customer_id" does not start with the letter 'Z'. This subquery will create a new table that contains a single column: "customer_id".

The final query joins the "customer_data" table with the "customer_amount" table using a "LEFT OUTER JOIN" operation. This will return all rows from the "customer_data" table and matching rows from the "customer_amount" table, if any exist. If there is no match in the "customer_amount" table, the "COALESCE" function is used to replace the "sum_amount" value with 0.

The "SELECT" clause chooses two columns to include in the output: "customer_id" from the "customer_data" table and "COALESCE(a.sum_amount, 0)" from the "customer_amount" table (or 0 if there is no matching row in "customer_amount"). The "LIMIT" clause limits the output to the first 10 rows.

So, this code will return the first 10 rows of the "customer_data" table, along with the sum of their receipts (or 0 if they have no receipts), for male customers whose "customer_id" does not start with the letter 'Z'.

S-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).

In [ ]:

%%sql

WITH customer_data AS (
    SELECT
      customer_id,
      sales_ymd,
      amount
    FROM 
      receipt
    WHERE
      customer_id NOT LIKE 'Z%'
),
customer_days AS (
    SELECT
      customer_id,
      COUNT(DISTINCT sales_ymd) come_days
    FROM 
      customer_data
    GROUP BY
      customer_id
    ORDER BY
      come_days DESC
    LIMIT 
      20
),
customer_amount AS (
    SELECT
      customer_id,
      SUM(amount) buy_amount
    FROM 
      customer_data
    GROUP BY
      customer_id
    ORDER BY
      buy_amount DESC
    LIMIT 
      20
)
SELECT
  COALESCE(d.customer_id, a.customer_id) customer_id,
  d.come_days,
  a.buy_amount
FROM 
  customer_days d
FULL OUTER JOIN 
  customer_amount a
ON
  d.customer_id = a.customer_id
;

 * postgresql://padawan:***@db:5432/dsdojo_db
34 rows affected.

Out[ ]:

customer_id	come_days	buy_amount
CS040214000008	23	None
CS015415000185	22	20153
CS010214000010	22	18585
CS028415000007	21	19127
CS010214000002	21	None
CS017415000097	20	23086
CS016415000141	20	18372
CS021514000045	19	None
CS022515000226	19	None
CS031414000051	19	19202
CS039414000052	19	None
CS014214000023	19	None
CS021515000172	19	None
CS031414000073	18	None
CS007515000107	18	None
CS014415000077	18	None
CS021515000056	18	None
CS032415000209	18	None
CS021515000211	18	None
CS022515000028	18	None
CS011415000006	None	16094
CS016415000101	None	16348
CS030415000034	None	15468
CS021515000089	None	17580
CS034415000047	None	16083
CS006515000023	None	18372
CS038415000104	None	17847
CS015515000034	None	15300
CS032414000072	None	16563
CS011414000106	None	18338
CS001605000009	None	18925
CS009414000059	None	15492
CS035414000024	None	17615
CS007514000094	None	15735

Commentary :

This code is also written in SQL and is a bit more complex than the previous ones. It selects data from the "receipt" table and creates three subqueries: "customer_data", "customer_days", and "customer_amount".

The "customer_data" subquery selects rows from the "receipt" table where the "customer_id" does not start with the letter 'Z' and gives an alias to three columns: "customer_id", "sales_ymd", and "amount". This subquery will create a new table that contains three columns: "customer_id", "sales_ymd", and "amount".

The "customer_days" subquery aggregates the "sales_ymd" column from the "customer_data" subquery by "customer_id" using the "COUNT(DISTINCT sales_ymd)" function and gives an alias "come_days" to the result. It then orders the result by "come_days" in descending order and limits the output to the top 20 rows. This subquery will create a new table that contains two columns: "customer_id" and "come_days".

The "customer_amount" subquery aggregates the "amount" column from the "customer_data" subquery by "customer_id" using the "SUM" function and gives an alias "buy_amount" to the result. It then orders the result by "buy_amount" in descending order and limits the output to the top 20 rows. This subquery will create a new table that contains two columns: "customer_id" and "buy_amount".

The final query joins the "customer_days" table with the "customer_amount" table using a "FULL OUTER JOIN" operation. This will return all rows from both tables, matching rows from both tables, and NULL values where there is no match. The "COALESCE" function is used to replace the NULL values in "customer_id" with the corresponding value from the other table.

The "SELECT" clause chooses three columns to include in the output: "COALESCE(d.customer_id, a.customer_id)" to get the customer_id, "d.come_days" from the "customer_days" table, and "a.buy_amount" from the "customer_amount" table.

So, this code will return the top 20 customers with the highest "come_days" (i.e., the number of days they visited the store) and the top 20 customers with the highest "buy_amount" (i.e., the total amount they spent in the store), along with their customer_id. If a customer appears in both tables, it will only appear once in the output.

S-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.

In [ ]:

%%sql

SELECT
  COUNT(1)
FROM 
  store
CROSS JOIN 
  product
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

count
531590

Commentary :

This SQL code retrieves a count of the number of rows in a result set generated by performing a Cartesian product between the "store" and "product" tables.

The code starts with a SELECT statement that specifies the COUNT function and the value to be counted is "1", which means to count the number of rows.

The FROM clause specifies two tables to be used in the query - "store" and "product". However, instead of joining the two tables together based on some matching criteria, the CROSS JOIN keyword is used. This means that every row from the "store" table will be paired with every row from the "product" table, resulting in a Cartesian product.

The resulting output will be a single row with a single column containing the count of the number of rows in the Cartesian product. This count will be equal to the number of rows in the "store" table multiplied by the number of rows in the "product" table. This type of query is generally used to calculate the total number of possible combinations of data from two or more tables.

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - SQL part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - SQL part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - SQL part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - SQL part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - SQL part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.