参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- The database is PostgreSQL13
- First, execute the following cell
- You can issue SQL by writing %%sql in the cell
- You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
- You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
- It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
- If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
- Names, addresses etc. are dummy data and do not exist.
%load_ext sql
import os
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)
# Set up Magic commands to write SQL
%sql $dsl
'Connected: padawan@dsdojo_db'
How to use
- You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.
%%sql
SELECT 'You can run it like this' AS sample;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
sample |
---|
You can run it like this |
100 data processing knocks
S-021: Count the number of cases against the receipt details data (df_receipt).
%%sql
-- Code Example 1
SELECT
COUNT(1)
FROM
receipt
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
count |
---|
104681 |
Commentary :
The code is a SQL query that selects the count of rows from a table called "receipt".
The "SELECT" keyword is used to indicate that we want to retrieve data from the database.
The "COUNT" function is used to count the number of rows in the "receipt" table. In this case, the "COUNT(1)" statement is used to count the number of rows in the table where the value in the first column of each row is not null. This is equivalent to using "COUNT(*)", which counts all rows in the table regardless of whether any columns are null.
The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.
So, in summary, the query selects and returns the count of rows in the "receipt" table.
%%sql
-- Code example 2 (* is also acceptable).
SELECT
COUNT(*)
FROM
receipt
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
count |
---|
104681 |
Commentary :
The code is a SQL query that selects the count of rows from a table called "receipt".
The "SELECT" keyword is used to indicate that we want to retrieve data from the database.
The "COUNT" function is used to count the number of rows in the "receipt" table. In this case, the "COUNT(*)" statement is used to count all rows in the table, regardless of whether any columns are null or not.
The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.
So, in summary, the query selects and returns the count of all rows in the "receipt" table, including those with null values in any columns.
S-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).
%%sql
SELECT
COUNT(DISTINCT customer_id)
FROM
receipt
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
count |
---|
8307 |
Commentary :
This code is a SQL query that selects the count of distinct customer IDs from a table called "receipt".
The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.
The "SELECT" keyword is used to indicate that we want to retrieve data from the database.
The "COUNT" function is used to count the number of distinct customer IDs in the "receipt" table. In this case, the "COUNT(DISTINCT customer_id)" statement is used to count the number of unique customer IDs in the table.
The "DISTINCT" keyword is used to specify that only unique values of the "customer_id" column should be counted.
The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.
So, in summary, the query selects and returns the count of unique customer IDs in the "receipt" table.
S-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).
%%sql
SELECT
store_cd,
SUM(amount) AS amount,
SUM(quantity) AS quantity
FROM
receipt
Group By
store_cd
;
* postgresql://padawan:***@db:5432/dsdojo_db 52 rows affected.
store_cd | amount | quantity |
---|---|---|
S12007 | 638761 | 2099 |
S13017 | 748221 | 2376 |
S13043 | 587895 | 1881 |
S13052 | 100314 | 250 |
S13016 | 793773 | 2432 |
S14027 | 714550 | 2303 |
S13009 | 808870 | 2486 |
S14022 | 651328 | 2047 |
S13019 | 827833 | 2541 |
S13039 | 611888 | 1981 |
S14046 | 412646 | 1354 |
S13003 | 764294 | 2197 |
S14028 | 786145 | 2458 |
S14045 | 458484 | 1398 |
S13002 | 727821 | 2340 |
S14042 | 534689 | 1935 |
S13004 | 779373 | 2390 |
S13038 | 708884 | 2337 |
S12014 | 725167 | 2358 |
S14021 | 699511 | 2231 |
S14006 | 712839 | 2284 |
S13001 | 811936 | 2347 |
S14023 | 727630 | 2258 |
S14025 | 755581 | 2394 |
S13035 | 715869 | 2219 |
S14048 | 234276 | 769 |
S14012 | 720600 | 2412 |
S14024 | 736323 | 2417 |
S13041 | 728266 | 2233 |
S14026 | 824537 | 2503 |
S14010 | 790361 | 2290 |
S13015 | 780873 | 2248 |
S13008 | 809288 | 2491 |
S14049 | 230808 | 788 |
S14050 | 167090 | 580 |
S13031 | 705968 | 2336 |
S12013 | 787513 | 2425 |
S13044 | 520764 | 1729 |
S14033 | 725318 | 2282 |
S12029 | 794741 | 2555 |
S13037 | 693087 | 2344 |
S12030 | 684402 | 2403 |
S13032 | 790501 | 2491 |
S14034 | 653681 | 2024 |
S14040 | 701858 | 2233 |
S13005 | 629876 | 2004 |
S14047 | 338329 | 1041 |
S14011 | 805724 | 2434 |
S13018 | 790535 | 2562 |
S13020 | 796383 | 2383 |
S13051 | 107452 | 354 |
S14036 | 203694 | 635 |
Commentary :
SQL code is repeated twice.
The code selects three columns, "store_cd", "amount", and "quantity" from a table called "receipt". The table contains sales data for a store.
The "SUM" function is used to calculate the sum of the "amount" and "quantity" columns for each store.
The "GROUP BY" clause is used to group the sales data by store. This means that the query will return the sum of "amount" and "quantity" for each unique store code in the "store_cd" column.
The first occurrence of the code might be a duplication or a mistake, as it is the exact same as the second occurrence. Nonetheless, the query selects and returns the total amount and quantity of products sold for each store in the "receipt" table.
S-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).
%%sql
SELECT
customer_id,
MAX(sales_ymd)
FROM
receipt
GROUP BY
customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | max |
---|---|
CS001311000059 | 20180211 |
CS004614000122 | 20181228 |
CS003512000043 | 20180106 |
CS011615000061 | 20190503 |
CS029212000033 | 20180621 |
CS007515000119 | 20190511 |
CS034515000123 | 20190708 |
CS004315000058 | 20170517 |
CS026414000014 | 20190720 |
CS001615000099 | 20170729 |
Commentary :
This code is a SQL query that selects the maximum sales date (sales_ymd) for each customer from a table called "receipt".
The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.
The "SELECT" keyword is used to indicate that we want to retrieve data from the database.
The "customer_id" and "MAX(sales_ymd)" are the columns that we want to select from the "receipt" table. The "MAX" function is used to select the maximum value of the "sales_ymd" column for each unique customer ID in the "customer_id" column.
The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.
The "GROUP BY" clause is used to group the sales data by customer ID. This means that the query will return the maximum sales date for each unique customer ID in the "customer_id" column.
The "LIMIT" keyword is used to limit the number of results returned by the query. In this case, the "LIMIT 10" statement limits the results to the top 10 rows.
So, in summary, the query selects and returns the maximum sales date for each customer from the "receipt" table and limits the results to the top 10 customers.
S-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.
%%sql
SELECT
customer_id,
MIN(sales_ymd)
FROM
receipt
GROUP BY
customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | min |
---|---|
CS001311000059 | 20180211 |
CS004614000122 | 20181228 |
CS003512000043 | 20180106 |
CS011615000061 | 20190503 |
CS029212000033 | 20170318 |
CS007515000119 | 20170201 |
CS034515000123 | 20170527 |
CS004315000058 | 20170517 |
CS026414000014 | 20170718 |
CS001615000099 | 20170729 |
Commentary :
This code is a SQL query that selects the minimum sales date (sales_ymd) for each customer from a table called "receipt".
The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.
The "SELECT" keyword is used to indicate that we want to retrieve data from the database.
The "customer_id" and "MIN(sales_ymd)" are the columns that we want to select from the "receipt" table. The "MIN" function is used to select the minimum value of the "sales_ymd" column for each unique customer ID in the "customer_id" column.
The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.
The "GROUP BY" clause is used to group the sales data by customer ID. This means that the query will return the minimum sales date for each unique customer ID in the "customer_id" column.
The "LIMIT" keyword is used to limit the number of results returned by the query. In this case, the "LIMIT 10" statement limits the results to the top 10 rows.
So, in summary, the query selects and returns the minimum sales date for each customer from the "receipt" table and limits the results to the top 10 customers.
S-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.
%%sql
SELECT
customer_id,
MAX(sales_ymd),
MIN(sales_ymd)
FROM
receipt
GROUP BY
customer_id
HAVING
MAX(sales_ymd) != MIN(sales_ymd)
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | max | min |
---|---|---|
CS029212000033 | 20180621 | 20170318 |
CS007515000119 | 20190511 | 20170201 |
CS034515000123 | 20190708 | 20170527 |
CS026414000014 | 20190720 | 20170718 |
CS010515000082 | 20181204 | 20180518 |
CS019315000045 | 20170920 | 20170423 |
CS008513000099 | 20190308 | 20170722 |
CS007615000070 | 20191025 | 20170929 |
CS025415000155 | 20191026 | 20170314 |
CS016414000063 | 20190617 | 20170109 |
Commentary :
This code is a SQL query that selects the customer IDs, maximum sales date, and minimum sales date from a table called "receipt".
The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.
The "SELECT" keyword is used to indicate that we want to retrieve data from the database.
The "customer_id", "MAX(sales_ymd)", and "MIN(sales_ymd)" are the columns that we want to select from the "receipt" table. The "MAX" function is used to select the maximum value of the "sales_ymd" column for each unique customer ID in the "customer_id" column, and the "MIN" function is used to select the minimum value of the "sales_ymd" column for each unique customer ID in the "customer_id" column.
The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.
The "GROUP BY" clause is used to group the sales data by customer ID. This means that the query will return the maximum and minimum sales dates for each unique customer ID in the "customer_id" column.
The "HAVING" clause is used to filter the results based on a condition. In this case, the condition is that the maximum sales date and minimum sales date should not be equal for a given customer ID. This means that the query will only return customers who have made purchases on different dates.
The "LIMIT" keyword is used to limit the number of results returned by the query. In this case, the "LIMIT 10" statement limits the results to the top 10 rows.
So, in summary, the query selects and returns the customer IDs, maximum sales date, and minimum sales date from the "receipt" table for customers who have made purchases on different dates and limits the results to the top 10 customers.
S-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.
%%sql
SELECT
store_cd,
AVG(amount) AS avg_amount
FROM
receipt
GROUP BY
store_cd
ORDER BY
avg_amount DESC
LIMIT
5
;
* postgresql://padawan:***@db:5432/dsdojo_db 5 rows affected.
store_cd | avg_amount |
---|---|
S13052 | 402.8674698795180723 |
S13015 | 351.1119604316546763 |
S13003 | 350.9155188246097337 |
S14010 | 348.7912621359223301 |
S13001 | 348.4703862660944206 |
Commentary :
This code is a SQL query that selects the store codes and their average sales amount from a table called "receipt".
The "%%sql" is a Jupyter Notebook cell magic command to indicate that this code is SQL code.
The "SELECT" keyword is used to indicate that we want to retrieve data from the database.
The "store_cd" and "AVG(amount) AS avg_amount" are the columns that we want to select from the "receipt" table. The "AVG" function is used to calculate the average sales amount for each unique store code in the "store_cd" column. The "AS" keyword is used to give the column an alias, "avg_amount", which is the name that will be displayed in the query results.
The "FROM" keyword is used to indicate the table we want to retrieve data from, in this case, the "receipt" table.
The "GROUP BY" clause is used to group the sales data by store code. This means that the query will calculate the average sales amount for each unique store code in the "store_cd" column.
The "ORDER BY" clause is used to sort the results in descending order based on the average sales amount. This means that the stores with the highest average sales amount will be displayed at the top of the results.
The "LIMIT" keyword is used to limit the number of results returned by the query. In this case, the "LIMIT 5" statement limits the results to the top 5 stores.
So, in summary, the query selects and returns the store codes and their average sales amount from the "receipt" table, sorted in descending order based on the average sales amount, and limits the results to the top 5 stores.
S-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.
%%sql
SELECT
store_cd,
PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY amount) AS amount_50per
FROM
receipt
GROUP BY
store_cd
ORDER BY
amount_50per DESC
LIMIT
5
;
* postgresql://padawan:***@db:5432/dsdojo_db 5 rows affected.
store_cd | amount_50per |
---|---|
S13052 | 190.0 |
S14010 | 188.0 |
S14050 | 185.0 |
S13003 | 180.0 |
S13018 | 180.0 |
Commentary :
This is a SQL query that retrieves data from a table called "receipt".
The query selects two columns from the table: "store_cd" and "amount_50per". The "store_cd" column represents the store code, and the "amount_50per" column calculates the median amount of money spent per receipt within each store.
The function used to calculate the median is "PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY amount)". This function orders the "amount" column in ascending order and then calculates the value at the 50th percentile. In other words, it finds the middle value of the ordered list of amounts.
The query groups the results by store code using the "GROUP BY" clause, and orders the results in descending order of the median amount using the "ORDER BY" clause. It then limits the output to the top 5 results using the "LIMIT" clause.
So, the query returns the top 5 stores with the highest median amount spent per receipt.
S-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.
%%sql
-- Code example 1: tabulating mode with window and analytic functions
WITH product_cnt AS (
SELECT
store_cd,
product_cd,
COUNT(1) AS mode_cnt
FROM
receipt
GROUP BY
store_cd,
product_cd
),
product_mode AS (
SELECT
store_cd,
product_cd,
mode_cnt,
RANK() OVER(PARTITION BY store_cd ORDER BY mode_cnt DESC) AS rnk
FROM
product_cnt
)
SELECT
store_cd,
product_cd,
mode_cnt
FROM
product_mode
WHERE
rnk = 1
ORDER BY
store_cd,
product_cd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
store_cd | product_cd | mode_cnt |
---|---|---|
S12007 | P060303001 | 72 |
S12013 | P060303001 | 107 |
S12014 | P060303001 | 65 |
S12029 | P060303001 | 92 |
S12030 | P060303001 | 115 |
S13001 | P060303001 | 67 |
S13002 | P060303001 | 78 |
S13003 | P071401001 | 65 |
S13004 | P060303001 | 88 |
S13005 | P040503001 | 36 |
Commentary :
This is a SQL query that retrieves data from a table called "receipt".
The query uses a Common Table Expression (CTE) called "product_cnt" to calculate the number of times each product has been sold at each store. It groups the data by "store_cd" and "product_cd" using the "GROUP BY" clause and counts the number of rows in each group using the "COUNT(1)" function. The results are stored in the "mode_cnt" column.
The query then uses another CTE called "product_mode" to rank the "mode_cnt" values for each store's products using the "RANK() OVER(PARTITION BY store_cd ORDER BY mode_cnt DESC)" function. This function assigns a rank to each product based on its "mode_cnt" value, within each store. The products with the highest mode count are assigned a rank of 1.
Finally, the query selects the top-selling product for each store using the "WHERE rnk = 1" clause to filter the results, and orders them by store and product code using the "ORDER BY" clause. It returns the store code, product code, and mode count for the top-selling product at each store, and limits the output to the first 10 rows using the "LIMIT" clause.
In summary, this query finds the top-selling product at each store, based on the number of times it has been sold, and returns the store code, product code, and mode count for each of the top 10 products.
%%sql
-- Code example 2: simple case of using MODE() (fast, but only one mode is selected if there is more than one mode)
SELECT
store_cd,
MODE() WITHIN GROUP(ORDER BY product_cd)
FROM
receipt
GROUP BY
store_cd
ORDER BY
store_cd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
store_cd | mode |
---|---|
S12007 | P060303001 |
S12013 | P060303001 |
S12014 | P060303001 |
S12029 | P060303001 |
S12030 | P060303001 |
S13001 | P060303001 |
S13002 | P060303001 |
S13003 | P071401001 |
S13004 | P060303001 |
S13005 | P040503001 |
Commentary :
This is a SQL query that retrieves data from a table called "receipt".
The query selects two columns: "store_cd" and the mode of the "product_cd" column for each store.
The "MODE() WITHIN GROUP(ORDER BY product_cd)" function calculates the mode of the "product_cd" values within each group of "store_cd". The mode is the most frequently occurring value in a set of data. The "ORDER BY" clause specifies that the "product_cd" values should be ordered in ascending order before calculating the mode.
The query groups the data by "store_cd" using the "GROUP BY" clause and orders the results by "store_cd" using the "ORDER BY" clause. It limits the output to the first 10 rows using the "LIMIT" clause.
So, the query returns the mode of the "product_cd" values for each store, which represents the most commonly sold product at each store. The results are ordered by store code and limited to the top 10 stores.
S-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.
%%sql
SELECT
store_cd,
VAR_POP(amount) AS vars_amount
FROM
receipt
GROUP BY
store_cd
ORDER BY
vars_amount DESC
LIMIT
5
;
* postgresql://padawan:***@db:5432/dsdojo_db 5 rows affected.
store_cd | vars_amount |
---|---|
S13052 | 440088.701311269173 |
S14011 | 306314.558163888889 |
S14034 | 296920.081011283873 |
S13001 | 295431.993329035348 |
S13015 | 295294.361115940880 |
Commentary :
This is a SQL query that retrieves data from a table called "receipt".
The query selects two columns: "store_cd" and the population variance of the "amount" column for each store.
The "VAR_POP(amount)" function calculates the variance of the "amount" values within each group of "store_cd". The variance is a measure of the spread or dispersion of a set of data. The "POP" in "VAR_POP" stands for "population", which means that the entire dataset is used to calculate the variance.
The query groups the data by "store_cd" using the "GROUP BY" clause and orders the results by the "vars_amount" column, which contains the variance of the "amount" values for each store, in descending order using the "ORDER BY" clause. It limits the output to the first 5 rows using the "LIMIT" clause.
So, the query returns the top 5 stores with the highest population variance in the "amount" values, which means that these stores have the highest spread or dispersion of the amount spent per receipt.
S-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.
%%sql
SELECT
store_cd,
STDDEV_POP(amount) as stds_amount
FROM
receipt
GROUP BY
store_cd
ORDER BY
stds_amount DESC
LIMIT
5
;
* postgresql://padawan:***@db:5432/dsdojo_db 5 rows affected.
store_cd | stds_amount |
---|---|
S13052 | 663.391815830787 |
S14011 | 553.456916267101 |
S14034 | 544.903735545357 |
S13001 | 543.536561170484 |
S13015 | 543.409938366921 |
Commentary :
This is a SQL query that retrieves data from a table called "receipt".
The query selects two columns: "store_cd" and the population standard deviation of the "amount" column for each store.
The "STDDEV_POP(amount)" function calculates the standard deviation of the "amount" values within each group of "store_cd". The standard deviation is a measure of the spread or dispersion of a set of data. The "POP" in "STDDEV_POP" stands for "population", which means that the entire dataset is used to calculate the standard deviation.
The query groups the data by "store_cd" using the "GROUP BY" clause and orders the results by the "stds_amount" column, which contains the standard deviation of the "amount" values for each store, in descending order using the "ORDER BY" clause. It limits the output to the first 5 rows using the "LIMIT" clause.
So, the query returns the top 5 stores with the highest population standard deviation in the "amount" values, which means that these stores have the highest spread or dispersion of the amount spent per receipt.
S-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.
%%sql
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY amount) AS amount_25per,
PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY amount) AS amount_50per,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY amount) AS amount_75per,
PERCENTILE_CONT(1.0) WITHIN GROUP(ORDER BY amount) AS amount_100per
FROM
receipt
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
amount_25per | amount_50per | amount_75per | amount_100per |
---|---|---|---|
102.0 | 170.0 | 288.0 | 10925.0 |
Commentary :
This is SQL code used to calculate percentile values for the "amount" column in the "receipt" table. The code uses the PERCENTILE_CONT function to compute continuous percentiles, which means that the percentile values will correspond to actual values in the data set rather than interpolated values between data points.
The code specifies four percentile values to be computed: the 25th percentile (amount_25per), the 50th percentile (amount_50per), the 75th percentile (amount_75per), and the 100th percentile (amount_100per). These percentiles are calculated using the WITHIN GROUP clause to order the "amount" column in ascending order before computing the percentiles.
The SELECT statement at the end of the code retrieves the calculated percentile values from the query and returns them as a single row of output. This code is useful for analyzing the distribution of the "amount" column in the "receipt" table and identifying any outliers or unusual patterns in the data.
S-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.
%%sql
SELECT
store_cd,
AVG(amount) AS avg_amount
FROM
receipt
GROUP BY
store_cd
HAVING
AVG(amount) >= 330
;
* postgresql://padawan:***@db:5432/dsdojo_db 13 rows affected.
store_cd | avg_amount |
---|---|
S13052 | 402.8674698795180723 |
S13019 | 330.2086158755484643 |
S13003 | 350.9155188246097337 |
S14045 | 330.0820734341252700 |
S13004 | 330.9439490445859873 |
S13001 | 348.4703862660944206 |
S14026 | 332.3405884723901653 |
S14010 | 348.7912621359223301 |
S13015 | 351.1119604316546763 |
S12013 | 330.1941299790356394 |
S14047 | 330.0770731707317073 |
S14011 | 335.7183333333333333 |
S13020 | 337.8799321170980059 |
Commentary :
This is SQL code used to retrieve average transaction amounts for each store in the "receipt" table and filter the results based on a condition.
The code begins with a SELECT statement that specifies two columns to be retrieved: the "store_cd" column and the average of the "amount" column for each store. The AVG function is used to calculate the average of the "amount" column for each store, and the AS keyword is used to assign the alias "avg_amount" to this calculated column.
The next line of code uses the GROUP BY clause to group the data by the "store_cd" column, which means that the average transaction amount will be calculated separately for each store.
The code then uses the HAVING clause to filter the results based on a condition. Specifically, the condition requires that the average transaction amount for each store be greater than or equal to 330. Any stores that do not meet this condition will not be included in the output.
The resulting output will be a table with two columns: "store_cd" and "avg_amount". Each row will correspond to a store in the "receipt" table that meets the condition specified in the HAVING clause, and the "avg_amount" column will contain the average transaction amount for that store. This code is useful for analyzing the performance of different stores based on the average transaction amount.
S-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.
%%sql
WITH customer_amount AS (
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
)
SELECT
AVG(sum_amount)
FROM
customer_amount
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
avg |
---|
2547.7422345292559595 |
Commentary :
This is SQL code that calculates the average of the total amount spent by each customer in the "receipt" table, excluding customers whose IDs start with the letter "Z".
The code starts with a common table expression (CTE) defined using the WITH clause. The CTE is named "customer_amount" and it contains a SELECT statement that retrieves the "customer_id" column and the sum of the "amount" column for each customer in the "receipt" table. The WHERE clause filters out any customers whose IDs start with the letter "Z" using the NOT LIKE operator. The results are grouped by the "customer_id" column, and the resulting table has two columns: "customer_id" and "sum_amount", which is the total amount spent by each customer.
The next line of code retrieves the average of the "sum_amount" column from the "customer_amount" CTE. The SELECT statement uses the AVG function to calculate the average of the "sum_amount" column, which represents the average total amount spent by each customer whose ID does not start with "Z".
The resulting output will be a single value representing the average of the "sum_amount" column from the "customer_amount" CTE. This code is useful for analyzing the average spending patterns of customers and identifying any outliers or unusual patterns in the data.
S-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.
%%sql
WITH customer_amount AS (
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
)
SELECT
customer_id,
sum_amount
FROM
customer_amount
WHERE
sum_amount >= (
SELECT
AVG(sum_amount)
FROM
customer_amount
)
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount |
---|---|
CS029212000033 | 3604 |
CS007515000119 | 7157 |
CS034515000123 | 3699 |
CS026414000014 | 6671 |
CS007615000070 | 2975 |
CS016414000063 | 6207 |
CS012514000018 | 2562 |
CS029515000142 | 3420 |
CS015215000021 | 3090 |
CS039814000011 | 8031 |
Commentary :
This is SQL code that retrieves the "customer_id" and "sum_amount" columns for the top 10 customers who have spent more than the average total amount spent by customers in the "receipt" table, excluding customers whose IDs start with the letter "Z".
The code starts with a common table expression (CTE) defined using the WITH clause. The CTE is named "customer_amount" and it contains a SELECT statement that retrieves the "customer_id" column and the sum of the "amount" column for each customer in the "receipt" table. The WHERE clause filters out any customers whose IDs start with the letter "Z" using the NOT LIKE operator. The results are grouped by the "customer_id" column, and the resulting table has two columns: "customer_id" and "sum_amount", which is the total amount spent by each customer.
The next line of code retrieves the "customer_id" and "sum_amount" columns from the "customer_amount" CTE, but only for customers whose "sum_amount" is greater than or equal to the average "sum_amount" calculated over all customers in the CTE. This is achieved using a subquery that calculates the average "sum_amount" using the AVG function and then filters the "customer_amount" CTE using the WHERE clause.
The results of this query are then limited to the top 10 rows using the LIMIT clause.
The resulting output will be a table with two columns: "customer_id" and "sum_amount". Each row will correspond to a customer who has spent more than the average total amount spent by customers in the "receipt" table, excluding customers whose IDs start with the letter "Z". The table will be sorted in descending order based on the "sum_amount" column, and only the top 10 rows will be included in the output. This code is useful for identifying high-spending customers and analyzing their spending patterns.
S-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.
%%sql
SELECT
r.*,
s.store_name
FROM
receipt r
JOIN
store s
ON
r.store_cd = s.store_cd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
sales_ymd | sales_epoch | store_cd | receipt_no | receipt_sub_no | customer_id | product_cd | quantity | amount | store_name |
---|---|---|---|---|---|---|---|---|---|
20181103 | 1541203200 | S14006 | 112 | 1 | CS006214000001 | P070305012 | 1 | 158 | 葛が谷店 |
20181118 | 1542499200 | S13008 | 1132 | 2 | CS008415000097 | P070701017 | 1 | 81 | 成城店 |
20170712 | 1499817600 | S14028 | 1102 | 1 | CS028414000014 | P060101005 | 1 | 170 | 二ツ橋店 |
20190205 | 1549324800 | S14042 | 1132 | 1 | ZZ000000000000 | P050301001 | 1 | 25 | 新山下店 |
20180821 | 1534809600 | S14025 | 1102 | 2 | CS025415000050 | P060102007 | 1 | 90 | 大和店 |
20190605 | 1559692800 | S13003 | 1112 | 1 | CS003515000195 | P050102002 | 1 | 138 | 狛江店 |
20181205 | 1543968000 | S14024 | 1102 | 2 | CS024514000042 | P080101005 | 1 | 30 | 三田店 |
20190922 | 1569110400 | S14040 | 1102 | 1 | CS040415000178 | P070501004 | 1 | 128 | 長津田店 |
20170504 | 1493856000 | S13020 | 1112 | 2 | ZZ000000000000 | P071302010 | 1 | 770 | 十条仲原店 |
20191010 | 1570665600 | S14027 | 1102 | 1 | CS027514000015 | P071101003 | 1 | 680 | 南藤沢店 |
Commentary :
This is SQL code that retrieves the first 10 rows from a joined table containing data from the "receipt" and "store" tables. The resulting table will contain all columns from the "receipt" table, plus an additional column "store_name" from the "store" table.
The code starts with a SELECT statement that specifies the columns to retrieve. In this case, the SELECT statement uses the wildcard (*) to retrieve all columns from the "receipt" table. The SELECT statement also specifies the "store_name" column from the "store" table, which will be included in the output table.
The FROM clause specifies that the data is being retrieved from the "receipt" table, and the JOIN clause specifies that the "store" table is being joined to the "receipt" table using the "store_cd" column as the join condition. This means that rows from the "receipt" table and the "store" table will be combined into a single table based on matching values in the "store_cd" column.
The resulting output will be a table containing all columns from the "receipt" table, plus an additional column "store_name" from the "store" table. The table will be limited to the first 10 rows using the LIMIT clause. This code is useful for analyzing data from two related tables and combining the data into a single table for analysis.
S-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.
%%sql
SELECT
p.*,
c.category_small_name
FROM
product p
JOIN
category c
ON
p.category_small_cd = c.category_small_cd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_small_name |
---|---|---|---|---|---|---|
P040101001 | 04 | 0401 | 040101 | 198 | 149 | 弁当類 |
P040101002 | 04 | 0401 | 040101 | 218 | 164 | 弁当類 |
P040101003 | 04 | 0401 | 040101 | 230 | 173 | 弁当類 |
P040101004 | 04 | 0401 | 040101 | 248 | 186 | 弁当類 |
P040101005 | 04 | 0401 | 040101 | 268 | 201 | 弁当類 |
P040101006 | 04 | 0401 | 040101 | 298 | 224 | 弁当類 |
P040101007 | 04 | 0401 | 040101 | 338 | 254 | 弁当類 |
P040101008 | 04 | 0401 | 040101 | 420 | 315 | 弁当類 |
P040101009 | 04 | 0401 | 040101 | 498 | 374 | 弁当類 |
P040101010 | 04 | 0401 | 040101 | 580 | 435 | 弁当類 |
Commentary :
This code is written in SQL, a programming language used for managing and querying relational databases.
The code selects data from two tables: "product" and "category". The "JOIN" keyword is used to combine the rows from both tables where the "category_small_cd" column values match. This creates a new table that contains columns from both tables.
The "SELECT" keyword is used to choose which columns to include in the output. In this case, it selects all columns from the "product" table (specified by the "p.*" syntax) and the "category_small_name" column from the "category" table.
The "LIMIT" keyword limits the output to the first 10 rows. This is useful when working with large datasets to quickly test queries or get a sample of the data.
So, this code will return the first 10 rows of the combined "product" and "category" tables, where the "category_small_cd" values match, and will include all columns from the "product" table and the "category_small_name" column from the "category" table.
S-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.
%%sql
WITH customer_amount AS (
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
GROUP BY
customer_id
),
customer_data AS (
SELECT
customer_id
FROM
customer
WHERE
gender_cd = '1'
AND
customer_id NOT LIKE 'Z%'
)
SELECT
c.customer_id,
COALESCE(a.sum_amount, 0)
FROM
customer_data c
LEFT OUTER JOIN
customer_amount a
ON
c.customer_id = a.customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | coalesce |
---|---|
CS021313000114 | 0 |
CS031415000172 | 5088 |
CS028811000001 | 0 |
CS001215000145 | 875 |
CS015414000103 | 3122 |
CS033513000180 | 868 |
CS035614000014 | 0 |
CS011215000048 | 3444 |
CS009413000079 | 0 |
CS040412000191 | 210 |
Commentary :
This code is also written in SQL and is a bit more complex than the previous one. It selects data from three tables: "receipt", "customer" and "customer_amount".
The code uses a "WITH" clause to define two subqueries: "customer_amount" and "customer_data".
The "customer_amount" subquery aggregates the "amount" column from the "receipt" table by "customer_id" using the "SUM" function and gives an alias "sum_amount" to the result. This subquery will create a new table that contains two columns: "customer_id" and "sum_amount".
The "customer_data" subquery selects all rows from the "customer" table where "gender_cd" is equal to '1' and the "customer_id" does not start with the letter 'Z'. This subquery will create a new table that contains a single column: "customer_id".
The final query joins the "customer_data" table with the "customer_amount" table using a "LEFT OUTER JOIN" operation. This will return all rows from the "customer_data" table and matching rows from the "customer_amount" table, if any exist. If there is no match in the "customer_amount" table, the "COALESCE" function is used to replace the "sum_amount" value with 0.
The "SELECT" clause chooses two columns to include in the output: "customer_id" from the "customer_data" table and "COALESCE(a.sum_amount, 0)" from the "customer_amount" table (or 0 if there is no matching row in "customer_amount"). The "LIMIT" clause limits the output to the first 10 rows.
So, this code will return the first 10 rows of the "customer_data" table, along with the sum of their receipts (or 0 if they have no receipts), for male customers whose "customer_id" does not start with the letter 'Z'.
S-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).
%%sql
WITH customer_data AS (
SELECT
customer_id,
sales_ymd,
amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
),
customer_days AS (
SELECT
customer_id,
COUNT(DISTINCT sales_ymd) come_days
FROM
customer_data
GROUP BY
customer_id
ORDER BY
come_days DESC
LIMIT
20
),
customer_amount AS (
SELECT
customer_id,
SUM(amount) buy_amount
FROM
customer_data
GROUP BY
customer_id
ORDER BY
buy_amount DESC
LIMIT
20
)
SELECT
COALESCE(d.customer_id, a.customer_id) customer_id,
d.come_days,
a.buy_amount
FROM
customer_days d
FULL OUTER JOIN
customer_amount a
ON
d.customer_id = a.customer_id
;
* postgresql://padawan:***@db:5432/dsdojo_db 34 rows affected.
customer_id | come_days | buy_amount |
---|---|---|
CS040214000008 | 23 | None |
CS015415000185 | 22 | 20153 |
CS010214000010 | 22 | 18585 |
CS028415000007 | 21 | 19127 |
CS010214000002 | 21 | None |
CS017415000097 | 20 | 23086 |
CS016415000141 | 20 | 18372 |
CS021514000045 | 19 | None |
CS022515000226 | 19 | None |
CS031414000051 | 19 | 19202 |
CS039414000052 | 19 | None |
CS014214000023 | 19 | None |
CS021515000172 | 19 | None |
CS031414000073 | 18 | None |
CS007515000107 | 18 | None |
CS014415000077 | 18 | None |
CS021515000056 | 18 | None |
CS032415000209 | 18 | None |
CS021515000211 | 18 | None |
CS022515000028 | 18 | None |
CS011415000006 | None | 16094 |
CS016415000101 | None | 16348 |
CS030415000034 | None | 15468 |
CS021515000089 | None | 17580 |
CS034415000047 | None | 16083 |
CS006515000023 | None | 18372 |
CS038415000104 | None | 17847 |
CS015515000034 | None | 15300 |
CS032414000072 | None | 16563 |
CS011414000106 | None | 18338 |
CS001605000009 | None | 18925 |
CS009414000059 | None | 15492 |
CS035414000024 | None | 17615 |
CS007514000094 | None | 15735 |
Commentary :
This code is also written in SQL and is a bit more complex than the previous ones. It selects data from the "receipt" table and creates three subqueries: "customer_data", "customer_days", and "customer_amount".
The "customer_data" subquery selects rows from the "receipt" table where the "customer_id" does not start with the letter 'Z' and gives an alias to three columns: "customer_id", "sales_ymd", and "amount". This subquery will create a new table that contains three columns: "customer_id", "sales_ymd", and "amount".
The "customer_days" subquery aggregates the "sales_ymd" column from the "customer_data" subquery by "customer_id" using the "COUNT(DISTINCT sales_ymd)" function and gives an alias "come_days" to the result. It then orders the result by "come_days" in descending order and limits the output to the top 20 rows. This subquery will create a new table that contains two columns: "customer_id" and "come_days".
The "customer_amount" subquery aggregates the "amount" column from the "customer_data" subquery by "customer_id" using the "SUM" function and gives an alias "buy_amount" to the result. It then orders the result by "buy_amount" in descending order and limits the output to the top 20 rows. This subquery will create a new table that contains two columns: "customer_id" and "buy_amount".
The final query joins the "customer_days" table with the "customer_amount" table using a "FULL OUTER JOIN" operation. This will return all rows from both tables, matching rows from both tables, and NULL values where there is no match. The "COALESCE" function is used to replace the NULL values in "customer_id" with the corresponding value from the other table.
The "SELECT" clause chooses three columns to include in the output: "COALESCE(d.customer_id, a.customer_id)" to get the customer_id, "d.come_days" from the "customer_days" table, and "a.buy_amount" from the "customer_amount" table.
So, this code will return the top 20 customers with the highest "come_days" (i.e., the number of days they visited the store) and the top 20 customers with the highest "buy_amount" (i.e., the total amount they spent in the store), along with their customer_id. If a customer appears in both tables, it will only appear once in the output.
S-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.
%%sql
SELECT
COUNT(1)
FROM
store
CROSS JOIN
product
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
count |
---|
531590 |
Commentary :
This SQL code retrieves a count of the number of rows in a result set generated by performing a Cartesian product between the "store" and "product" tables.
The code starts with a SELECT statement that specifies the COUNT function and the value to be counted is "1", which means to count the number of rows.
The FROM clause specifies two tables to be used in the query - "store" and "product". However, instead of joining the two tables together based on some matching criteria, the CROSS JOIN keyword is used. This means that every row from the "store" table will be paired with every row from the "product" table, resulting in a Cartesian product.
The resulting output will be a single row with a single column containing the count of the number of rows in the Cartesian product. This count will be equal to the number of rows in the "store" table multiplied by the number of rows in the "product" table. This type of query is generally used to calculate the total number of possible combinations of data from two or more tables.
Comment