参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_SQL.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction
How to use
100 data processing knocks

Introduction

The database is PostgreSQL13
First, execute the following cell
You can issue SQL by writing %%sql in the cell
You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
Names, addresses etc. are dummy data and do not exist.

In [ ]:

%load_ext sql
import os

pgconfig = {
    'host': 'db',
    'port': os.environ['PG_PORT'],
    'database': os.environ['PG_DATABASE'],
    'user': os.environ['PG_USER'],
    'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)

# Set up Magic commands to write SQL
%sql $dsl

Out[ ]:

'Connected: padawan@dsdojo_db'

How to use

You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.

In [ ]:

%%sql
SELECT 'You can run it like this' AS sample;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

sample
You can run it like this

100 data processing knocks

S-041: Sum up the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd) and calculate the increase or decrease in sales amount from the date of the last sale. And display 10 results.

In [ ]:

%%sql

WITH sales_amount_by_date AS (
    SELECT
      sales_ymd,
      SUM(amount) AS amount
    FROM 
      receipt
    GROUP BY
      sales_ymd
),
sales_amount_by_date_with_lag as (
    SELECT
      sales_ymd,
      LAG(sales_ymd, 1) OVER(ORDER BY sales_ymd) lag_ymd,
      amount,
      LAG(amount, 1) OVER(ORDER BY sales_ymd) AS lag_amount
    FROM 
      sales_amount_by_date
)
SELECT
  sales_ymd,
  amount,
  lag_ymd,
  lag_amount,
  amount - lag_amount AS diff_amount
FROM 
  sales_amount_by_date_with_lag
ORDER BY
  sales_ymd
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

sales_ymd	amount	lag_ymd	lag_amount	diff_amount
20170101	33723	None	None	None
20170102	24165	20170101	33723	-9558
20170103	27503	20170102	24165	3338
20170104	36165	20170103	27503	8662
20170105	37830	20170104	36165	1665
20170106	32387	20170105	37830	-5443
20170107	23415	20170106	32387	-8972
20170108	24737	20170107	23415	1322
20170109	26718	20170108	24737	1981
20170110	20143	20170109	26718	-6575

Commentary :

This code is written in SQL language and is using the "WITH" clause to create two temporary tables named "sales_amount_by_date" and "sales_amount_by_date_with_lag". The code is then selecting specific columns from the "sales_amount_by_date_with_lag" table and calculating the difference in sales amount between consecutive dates using the "LAG" function.

More specifically, the code is performing the following operations:

The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.

The second "WITH" clause creates another temporary table called "sales_amount_by_date_with_lag" that uses the "LAG" function to calculate the previous date's sales amount and the difference in sales amount between consecutive dates.

The final "SELECT" statement selects columns from the "sales_amount_by_date_with_lag" table, including the sales date, sales amount, previous date, previous sales amount, and the difference in sales amount between consecutive dates.

The results are then sorted in ascending order by sales date and limited to the first 10 rows.

In summary, the code calculates the difference in sales amount between consecutive dates in the "receipt" table using the "LAG" function and displays the results for the first 10 rows.

S-042: Tabulate the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd), and for each date, combine the data of the last, previous and three previous sales dates. Then display 10 results.

In [ ]:

%%sql

-- Code example 1: Vertical holding case
WITH sales_amount_by_date AS (
    SELECT
      sales_ymd,
      SUM(amount) AS amount
    FROM 
      receipt
    GROUP BY
      sales_ymd
),
sales_amount_lag_date AS (
    SELECT
      sales_ymd,
      LAG(sales_ymd, 3) OVER (ORDER BY sales_ymd) AS lag_date_3,
      amount
    FROM 
      sales_amount_by_date
)
SELECT
  a.sales_ymd,
  a.amount,
  b.sales_ymd AS lag_ymd,
  b.amount AS lag_amount
FROM 
  sales_amount_lag_date a
JOIN 
  sales_amount_lag_date b
ON
    (
        a.lag_date_3 IS NULL
        OR a.lag_date_3 <= b.sales_ymd
    )
    AND b.sales_ymd < a.sales_ymd
ORDER BY
  sales_ymd,
  lag_ymd
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

sales_ymd	amount	lag_ymd	lag_amount
20170102	24165	20170101	33723
20170103	27503	20170101	33723
20170103	27503	20170102	24165
20170104	36165	20170101	33723
20170104	36165	20170102	24165
20170104	36165	20170103	27503
20170105	37830	20170102	24165
20170105	37830	20170103	27503
20170105	37830	20170104	36165
20170106	32387	20170103	27503

Commentary :

This code is also written in SQL language and is using the "WITH" clause to create two temporary tables named "sales_amount_by_date" and "sales_amount_lag_date". The code is then performing a self-join on the "sales_amount_lag_date" table and filtering the results based on certain conditions.

More specifically, the code is performing the following operations:

The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.

The second "WITH" clause creates another temporary table called "sales_amount_lag_date" that uses the "LAG" function to calculate the sales date three days prior and the sales amount for each date in the "sales_amount_by_date" table.

The final "SELECT" statement selects specific columns from the "sales_amount_lag_date" table and performs a self-join with itself. The join condition filters the results to include only rows where the sales date of one row is less than the sales date of another row and where the sales date of the second row is less than or equal to the sales date of the first row three days prior.

The results are then sorted in ascending order by sales date and previous sales date and limited to the first 10 rows.

In summary, the code is performing a self-join on a table that includes the sales date and sales amount for each date in the "receipt" table. The join condition filters the results to include only rows where the sales date of one row is less than the sales date of another row and where the sales date of the second row is less than or equal to the sales date of the first row three days prior. The results display the sales date, sales amount, previous sales date, and previous sales amount for the first 10 rows.

In [ ]:

%%sql

-- Code example 2: horizontal holding case
WITH sales_amount_by_date AS (
    SELECT
      sales_ymd,
      SUM(amount) AS amount
    FROM 
      receipt
    GROUP BY
      sales_ymd
),
sales_amount_with_lag AS (
    SELECT
      sales_ymd,
      amount, 
      LAG(sales_ymd, 1) OVER (ORDER BY sales_ymd) AS lag_ymd_1,
      LAG(amount, 1) OVER (ORDER BY sales_ymd) AS lag_amount_1,
      LAG(sales_ymd, 2) OVER (ORDER BY sales_ymd) AS lag_ymd_2,
      LAG(amount, 2) OVER (ORDER BY sales_ymd) AS lag_amount_2,
      LAG(sales_ymd, 3) OVER (ORDER BY sales_ymd) AS lag_ymd_3,
      LAG(amount, 3) OVER (ORDER BY sales_ymd) AS lag_amount_3
    FROM 
  sales_amount_by_date
)
SELECT
  sales_ymd,
  amount, 
  lag_ymd_1,
  lag_amount_1,
  lag_ymd_2,
  lag_amount_2,
  lag_ymd_3,
  lag_amount_3
FROM 
  sales_amount_with_lag
WHERE
  lag_ymd_3 IS NOT NULL
ORDER BY
  sales_ymd
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

sales_ymd	amount	lag_ymd_1	lag_amount_1	lag_ymd_2	lag_amount_2	lag_ymd_3	lag_amount_3
20170104	36165	20170103	27503	20170102	24165	20170101	33723
20170105	37830	20170104	36165	20170103	27503	20170102	24165
20170106	32387	20170105	37830	20170104	36165	20170103	27503
20170107	23415	20170106	32387	20170105	37830	20170104	36165
20170108	24737	20170107	23415	20170106	32387	20170105	37830
20170109	26718	20170108	24737	20170107	23415	20170106	32387
20170110	20143	20170109	26718	20170108	24737	20170107	23415
20170111	24287	20170110	20143	20170109	26718	20170108	24737
20170112	23526	20170111	24287	20170110	20143	20170109	26718
20170113	28004	20170112	23526	20170111	24287	20170110	20143

Commentary :

This code is also written in SQL language and is using the "WITH" clause to create a temporary table named "sales_amount_by_date". The code is then using the "LAG" function to create additional columns in another temporary table named "sales_amount_with_lag". Finally, the code is selecting specific columns from the "sales_amount_with_lag" table and filtering the results based on a certain condition.

More specifically, the code is performing the following operations:

The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.

The second "WITH" clause creates another temporary table called "sales_amount_with_lag" that uses the "LAG" function to calculate the sales date and sales amount for the previous three days for each date in the "sales_amount_by_date" table.

The final "SELECT" statement selects specific columns from the "sales_amount_with_lag" table and filters the results to include only rows where the third previous date is not null. This means that the first two rows of the table will be excluded since they do not have a third previous date.

The results are then sorted in ascending order by sales date and limited to the first 10 rows.

In summary, the code is calculating the total sales amount for each date in the "receipt" table and then using the "LAG" function to calculate the sales date and sales amount for the previous three days for each date. The results display the sales date and sales amount for each date and the sales date and sales amount for the previous three days. The results are filtered to exclude the first two rows since they do not have a third previous date. The first 10 rows of the results are displayed in ascending order by sales date.

S-043: Combine receipt details data (df_receipt) and customer data (df_customer) to create sales summary data, summing the sales amount (amount) by gender code (gender_cd) and age (calculated from age). Gender_cd should be 0 for male, 1 for female and 9 for unknown.

However, the item structure should consist of four items: age, female sales amount, male sales amount and sales amount of unknown gender (vertical cross tabulation of age and horizontal cross tabulation of gender). In addition, the age shall be in deciles of 10 years.

In [ ]:

%%sql

DROP TABLE IF EXISTS sales_summary;

CREATE TABLE sales_summary AS
    WITH gender_era_amount AS (
        SELECT
          TRUNC(age / 10) * 10 AS era,
          c.gender_cd,
          SUM(r.amount) AS amount
        FROM 
          customer c
        JOIN 
          receipt r
        ON
          c.customer_id = r.customer_id
        GROUP BY
          era,
          c.gender_cd
    )
    SELECT
      era,
      SUM(CASE WHEN gender_cd = '0' THEN amount END) AS male,
      SUM(CASE WHEN gender_cd = '1' THEN amount END) AS female,
      SUM(CASE WHEN gender_cd = '9' THEN amount END) AS unknown
    FROM 
      gender_era_amount
    GROUP BY
      era
    ORDER BY
      era
;

SELECT
  *
FROM 
  sales_summary
;

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
9 rows affected.
9 rows affected.

Out[ ]:

era	male	female	unknown
10.0	1591	149836	4317
20.0	72940	1363724	44328
30.0	177322	693047	50441
40.0	19355	9320791	483512
50.0	54320	6685192	342923
60.0	272469	987741	71418
70.0	13435	29764	2427
80.0	46360	262923	5111
90.0	None	6260	None

Commentary :

This SQL code creates a table named sales_summary that summarizes the total sales amount by age group (era) and gender.

First, a Common Table Expression (CTE) named gender_era_amount is created using a join between the customer and receipt tables. The CTE calculates the total sales amount by grouping the records based on age group (era) and gender.

Next, the sales_summary table is created by aggregating the data from the gender_era_amount CTE. The table has four columns: era, male, female, and unknown. The SUM(CASE...) statements within the SELECT clause are used to pivot the gender-specific sales amounts into separate columns.

Finally, the SELECT statement is used to retrieve all the rows from the newly created sales_summary table.

S-044: The sales summary data (df_sales_summary) created in 043 had sales by gender carried horizontally. From this data, convert the gender into three items: age, gender code and sales amount, by having the gender carried vertically. The gender code should be “00” for male, “01” for female and “99” for unknown.

In [ ]:

%%sql

-- somewhat aggressive description as it is not SQL-oriented (note that it can be very long SQL when the number of categories is large).
SELECT 
  era, 
  '00' AS gender_cd, 
  male AS amount 
FROM 
  sales_summary
UNION ALL
SELECT 
  era, 
  '01' AS gender_cd, 
  female AS amount 
FROM 
  sales_summary
UNION ALL
SELECT 
  era, 
  '99' AS gender_cd, 
  unknown AS amount 
FROM 
  sales_summary
;

 * postgresql://padawan:***@db:5432/dsdojo_db
27 rows affected.

Out[ ]:

era	gender_cd	amount
10.0	00	1591
20.0	00	72940
30.0	00	177322
40.0	00	19355
50.0	00	54320
60.0	00	272469
70.0	00	13435
80.0	00	46360
90.0	00	None
10.0	01	149836
20.0	01	1363724
30.0	01	693047
40.0	01	9320791
50.0	01	6685192
60.0	01	987741
70.0	01	29764
80.0	01	262923
90.0	01	6260
10.0	99	4317
20.0	99	44328
30.0	99	50441
40.0	99	483512
50.0	99	342923
60.0	99	71418
70.0	99	2427
80.0	99	5111
90.0	99	None

Commentary :

This code is written in SQL and retrieves data from a table called sales_summary. The purpose of this code is to transform the data in the sales_summary table to a format that can be more easily analyzed.

The SELECT statement is used to retrieve data from the sales_summary table, and the UNION ALL operator is used to combine the results of multiple SELECT statements into a single result set.

The first SELECT statement retrieves data from the sales_summary table and selects the era and male columns. It also includes a new column called gender_cd, which has a value of 00. This column is added using the string literal '00' AS gender_cd. The AS keyword is used to rename the column to gender_cd. The result set of this SELECT statement contains three columns: era, gender_cd, and amount.

The second SELECT statement is similar to the first one, but it selects the era and female columns and sets the gender_cd column to 01. The result set of this SELECT statement also contains three columns: era, gender_cd, and amount.

The third SELECT statement is also similar to the first two, but it selects the era and unknown columns and sets the gender_cd column to 99. The result set of this SELECT statement also contains three columns: era, gender_cd, and amount.

Finally, the UNION ALL operator is used to combine the result sets of the three SELECT statements into a single result set. The resulting table will have one row for each combination of era and gender_cd, and the amount column will contain the value of either male, female, or unknown, depending on the value of gender_cd.

S-045: The date of birth (birth_day) in the customer data (df_customer) is held in date type. Convert this to a string in YYYYMMDD format and display 10 cases together with the customer ID (customer_id).

In [ ]:

%%sql

SELECT
  customer_id, 
  TO_CHAR(birth_day, 'YYYYMMDD') AS birth_day
FROM 
  customer 
LIMIT
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	birth_day
CS021313000114	19810429
CS037613000071	19520401
CS031415000172	19761004
CS028811000001	19330327
CS001215000145	19950329
CS020401000016	19740915
CS015414000103	19770809
CS029403000008	19730817
CS015804000004	19310502
CS033513000180	19620711

Commentary :

This code is written in SQL and retrieves data from a table called customer. The purpose of this code is to retrieve the customer_id and birth_day columns from the customer table and format the birth_day column in a specific way.

The SELECT statement is used to retrieve data from the customer table, and the LIMIT clause is used to limit the number of rows returned to 10.

The TO_CHAR function is used to convert the birth_day column to a string with the specified format. In this case, the format is 'YYYYMMDD', which represents the year, month, and day of the birth date. The resulting string will have a length of 8 characters, with the year represented by 4 characters, the month by 2 characters, and the day by 2 characters.

The AS keyword is used to rename the resulting column to birth_day. The result set of this SELECT statement contains two columns: customer_id and birth_day. The birth_day column contains the birth date of each customer in the specified format.

Overall, this code is useful when you want to retrieve the customer_id and birth_day columns from the customer table and format the birth_day column in a specific way. This formatted data can be useful for analysis or reporting purposes.

S-046: The application date (application_date) of the customer data (df_customer) holds data in a string type in YYYYMMDD format. Convert this to a date type and display 10 cases together with the customer ID (customer_id).

In [ ]:

%%sql

SELECT
  customer_id,
  TO_DATE(application_date, 'YYYYMMDD') AS application_date
FROM 
  customer
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	application_date
CS021313000114	2015-09-05
CS037613000071	2015-04-14
CS031415000172	2015-05-29
CS028811000001	2016-01-15
CS001215000145	2017-06-05
CS020401000016	2015-02-25
CS015414000103	2015-07-22
CS029403000008	2015-05-15
CS015804000004	2015-06-07
CS033513000180	2015-07-28

Commentary :

This code is written in SQL and retrieves data from a table called customer. The purpose of this code is to retrieve the customer_id and application_date columns from the customer table and convert the application_date column to a date format.

The SELECT statement is used to retrieve data from the customer table, and the LIMIT clause is used to limit the number of rows returned to 10.

The TO_DATE function is used to convert the application_date column from a string to a date format. In this case, the format is 'YYYYMMDD', which represents the year, month, and day of the application date. The resulting date value will be in the format yyyy-mm-dd.

The AS keyword is used to rename the resulting column to application_date. The result set of this SELECT statement contains two columns: customer_id and application_date. The application_date column contains the application date of each customer in the date format.

Overall, this code is useful when you want to retrieve the customer_id and application_date columns from the customer table and convert the application_date column to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.

S-047: The sales date (sales_ymd) of the receipt details data (df_receipt) holds data in numeric type in YYYYMMDD format. Convert this to a date type and display 10 items together with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

%%sql

SELECT  
  receipt_no, 
  receipt_sub_no,
  TO_DATE(CAST(sales_ymd AS VARCHAR), 'YYYYMMDD') AS sales_ymd
FROM 
  receipt 
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

receipt_no	receipt_sub_no	sales_ymd
112	1	2018-11-03
1132	2	2018-11-18
1102	1	2017-07-12
1132	1	2019-02-05
1102	2	2018-08-21
1112	1	2019-06-05
1102	2	2018-12-05
1102	1	2019-09-22
1112	2	2017-05-04
1102	1	2019-10-10

Commentary :

This code is written in SQL and retrieves data from a table called receipt. The purpose of this code is to retrieve the receipt_no, receipt_sub_no, and sales_ymd columns from the receipt table and convert the sales_ymd column to a date format.

The SELECT statement is used to retrieve data from the receipt table, and the LIMIT clause is used to limit the number of rows returned to 10.

The CAST function is used to convert the sales_ymd column from an integer to a string. The resulting string will contain the year, month, and day of the sales date in the format 'YYYYMMDD'.

The TO_DATE function is used to convert the sales_ymd string to a date format. In this case, the format is 'YYYYMMDD', which represents the year, month, and day of the sales date. The resulting date value will be in the format yyyy-mm-dd.

The AS keyword is used to rename the resulting column to sales_ymd. The result set of this SELECT statement contains three columns: receipt_no, receipt_sub_no, and sales_ymd. The sales_ymd column contains the sales date of each receipt in the date format.

Overall, this code is useful when you want to retrieve the receipt_no, receipt_sub_no, and sales_ymd columns from the receipt table and convert the sales_ymd column to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.

S-048: The sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) holds data in UNIX seconds of numeric type. Convert it to date type and display 10 cases together with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

%%sql

SELECT 
  receipt_no,
  receipt_sub_no,
  CAST(TO_TIMESTAMP(sales_epoch) AS DATE) AS sales_ymd
FROM 
  receipt 
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

receipt_no	receipt_sub_no	sales_ymd
112	1	2018-11-03
1132	2	2018-11-18
1102	1	2017-07-12
1132	1	2019-02-05
1102	2	2018-08-21
1112	1	2019-06-05
1102	2	2018-12-05
1102	1	2019-09-22
1112	2	2017-05-04
1102	1	2019-10-10

Commentary :

This code is written in SQL and retrieves data from a table called receipt. The purpose of this code is to retrieve the receipt_no, receipt_sub_no, and sales_epoch columns from the receipt table and convert the sales_epoch column to a date format.

The SELECT statement is used to retrieve data from the receipt table, and the LIMIT clause is used to limit the number of rows returned to 10.

The TO_TIMESTAMP function is used to convert the sales_epoch column from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.

The CAST function is used to convert the timestamp to a date format. In this case, the resulting date value will be in the format yyyy-mm-dd.

The AS keyword is used to rename the resulting column to sales_ymd. The result set of this SELECT statement contains three columns: receipt_no, receipt_sub_no, and sales_ymd. The sales_ymd column contains the sales date of each receipt in the date format.

Overall, this code is useful when you want to retrieve the receipt_no, receipt_sub_no, and sales_epoch columns from the receipt table and convert the sales_epoch column to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.

S-049: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “年(year)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

%%sql

SELECT 
  receipt_no, 
  receipt_sub_no,
  EXTRACT(YEAR FROM TO_TIMESTAMP(sales_epoch)) AS sales_year
FROM 
  receipt 
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

receipt_no	receipt_sub_no	sales_year
112	1	2018
1132	2	2018
1102	1	2017
1132	1	2019
1102	2	2018
1112	1	2019
1102	2	2018
1102	1	2019
1112	2	2017
1102	1	2019

Commentary :

This code is written in SQL and retrieves data from a table called receipt. The purpose of this code is to retrieve the receipt_no, receipt_sub_no, and sales_epoch columns from the receipt table and extract the year from the sales_epoch timestamp column.

The SELECT statement is used to retrieve data from the receipt table, and the LIMIT clause is used to limit the number of rows returned to 10.

The TO_TIMESTAMP function is used to convert the sales_epoch column from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.

The EXTRACT function is used to extract the year from the sales_epoch timestamp. The resulting value will be an integer representing the year.

The AS keyword is used to rename the resulting column to sales_year. The result set of this SELECT statement contains three columns: receipt_no, receipt_sub_no, and sales_year. The sales_year column contains the year of the sales date of each receipt.

Overall, this code is useful when you want to retrieve the receipt_no, receipt_sub_no, and sales_epoch columns from the receipt table and extract the year from the sales_epoch timestamp column. This extracted data can be useful for grouping or filtering data based on the year of the sales date.

S-050: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “月(month)” and display 10 items with receipt number (receipt_no) and receipt sub number (receipt_sub_no). Note that the “月(month)” should be retrieved with two digits filled with zero.

In [ ]:

%%sql

SELECT 
  receipt_no, 
  receipt_sub_no,
  TO_CHAR(
            EXTRACT(MONTH FROM TO_TIMESTAMP(sales_epoch)), 
            'FM00'
         ) AS sales_month
FROM 
  receipt
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

receipt_no	receipt_sub_no	sales_month
112	1	11
1132	2	11
1102	1	07
1132	1	02
1102	2	08
1112	1	06
1102	2	12
1102	1	09
1112	2	05
1102	1	10

Commentary :

This code is written in SQL and retrieves data from a table called receipt. The purpose of this code is to retrieve the receipt_no, receipt_sub_no, and sales_epoch columns from the receipt table and extract the month from the sales_epoch timestamp column, then format the resulting month as a two-digit string with a leading zero.

The SELECT statement is used to retrieve data from the receipt table, and the LIMIT clause is used to limit the number of rows returned to 10.

The TO_TIMESTAMP function is used to convert the sales_epoch column from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.

The EXTRACT function is used to extract the month from the sales_epoch timestamp. The resulting value will be an integer representing the month.

The TO_CHAR function is used to convert the extracted month to a string with a leading zero if necessary. The 'FM00' format string specifies that the resulting string should have a fixed width of two characters, with any leading zeros removed.

The AS keyword is used to rename the resulting column to sales_month. The result set of this SELECT statement contains three columns: receipt_no, receipt_sub_no, and sales_month. The sales_month column contains the month of the sales date of each receipt as a two-digit string with a leading zero.

Overall, this code is useful when you want to retrieve the receipt_no, receipt_sub_no, and sales_epoch columns from the receipt table and extract the month from the sales_epoch timestamp column, then format the resulting month as a two-digit string with a leading zero. This formatted data can be useful for grouping or filtering data based on the month of the sales date.

S-051: Convert the sales epoch seconds of the receipt details data (df_receipt) to date type, extract only the “日(day)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no). Note that the “日(day)” should be extracted with two digits filled with zero.

In [ ]:

%%sql

SELECT 
  receipt_no, 
  receipt_sub_no,
  TO_CHAR(EXTRACT(DAY FROM TO_TIMESTAMP(sales_epoch)), 'FM00') AS sales_day
FROM 
  receipt 
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

receipt_no	receipt_sub_no	sales_day
112	1	03
1132	2	18
1102	1	12
1132	1	05
1102	2	21
1112	1	05
1102	2	05
1102	1	22
1112	2	04
1102	1	10

Commentary :

This code is written in SQL and retrieves data from a table called receipt. The purpose of this code is to retrieve the receipt_no, receipt_sub_no, and sales_epoch columns from the receipt table and extract the day from the sales_epoch timestamp column, then format the resulting day as a two-digit string with a leading zero.

The SELECT statement is used to retrieve data from the receipt table, and the LIMIT clause is used to limit the number of rows returned to 10.

The TO_TIMESTAMP function is used to convert the sales_epoch column from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.

The EXTRACT function is used to extract the day from the sales_epoch timestamp. The resulting value will be an integer representing the day.

The TO_CHAR function is used to convert the extracted day to a string with a leading zero if necessary. The 'FM00' format string specifies that the resulting string should have a fixed width of two characters, with any leading zeros removed.

The AS keyword is used to rename the resulting column to sales_day. The result set of this SELECT statement contains three columns: receipt_no, receipt_sub_no, and sales_day. The sales_day column contains the day of the sales date of each receipt as a two-digit string with a leading zero.

Overall, this code is useful when you want to retrieve the receipt_no, receipt_sub_no, and sales_epoch columns from the receipt table and extract the day from the sales_epoch timestamp column, then format the resulting day as a two-digit string with a leading zero. This formatted data can be useful for grouping or filtering data based on the day of the sales date.

S-052: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), binaryise the total sales amount to 0 for amounts less than 2,000 yen and 1 for amounts greater than 2,000 yen, and display 10 items with the customer ID and total sales amount. Display 10 items together with the customer ID and the total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

%%sql

SELECT
  customer_id,
  SUM(amount) AS sum_amount,
  CASE WHEN SUM(amount) > 2000 THEN 1 ELSE 0 END AS sales_flg
FROM 
  receipt
WHERE 
  customer_id NOT LIKE 'Z%'
GROUP BY 
  customer_id
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount	sales_flg
CS001311000059	2302	1
CS004614000122	248	0
CS003512000043	298	0
CS011615000061	246	0
CS029212000033	3604	1
CS007515000119	7157	1
CS034515000123	3699	1
CS004315000058	490	0
CS026414000014	6671	1
CS001615000099	768	0

Commentary :

This SQL code is selecting data from a table called receipt. The purpose of this code is to calculate the total sales amount for each customer in the receipt table, and also to create a flag indicating if the total sales amount for each customer is greater than 2000 or not.

The SELECT statement retrieves the customer_id column, the total sales amount for each customer calculated using the SUM function, and a CASE expression that returns 1 if the total sales amount is greater than 2000 and 0 otherwise. The resulting column that contains the flag is named sales_flg.

The WHERE clause filters out any records where the customer ID starts with the letter 'Z'. This could be used to exclude certain customers from the calculation or flagging.

The GROUP BY clause groups the results by customer_id so that the total sales amount and flag are calculated for each customer.

The LIMIT clause limits the output to the first 10 results.

Overall, this code is useful when you want to calculate the total sales amount for each customer in a receipt table and create a flag indicating if the total sales amount is greater than a certain threshold (in this case, 2000) or not. This could be useful for identifying high-spending customers or for other types of analysis where customer sales amounts need to be aggregated and analyzed.

S-053: Binarise the postcode (postal_cd) of the customer data (df_customer) to 1 for Tokyo (the first three digits are 100 to 209) and 0 for all others. Further, combine it with the receipt details data (df_receipt) and count the number of customers with sales results for the entire period, for each binary value created.

In [ ]:

%%sql

WITH cust AS (
    SELECT
      customer_id,
      postal_cd,
      CASE WHEN CAST(SUBSTR(postal_cd, 1, 3) AS INTEGER) BETWEEN 100 AND 209 THEN 1 ELSE 0  END AS postal_flg
    FROM
      customer
),
rect AS(
    SELECT DISTINCT
      customer_id
    FROM
      receipt
)
SELECT 
  c.postal_flg, 
  COUNT(DISTINCT c.customer_id) AS customer_cnt
FROM
  cust c
JOIN
  rect r
USING 
  (customer_id)
GROUP BY
  c.postal_flg
;

 * postgresql://padawan:***@db:5432/dsdojo_db
2 rows affected.

Out[ ]:

postal_flg	customer_cnt
0	3906
1	4400

Commentary :

This SQL code contains a Common Table Expression (CTE) named cust, which selects data from the customer table and calculates a postal_flg flag for each customer based on their postal_cd.

The postal_flg flag is calculated using a CASE expression, which checks if the first three characters of the postal_cd are between 100 and 209 (inclusive), and assigns a value of 1 if true and 0 if false. This flag could be used to group customers by postal code ranges, for example.

The second CTE named rect selects distinct customer_ids from the receipt table.

The final SELECT statement joins the cust and rect CTEs on the customer_id column and groups the results by the postal_flg column. The COUNT(DISTINCT c.customer_id) function counts the number of distinct customers for each postal_flg value.

Overall, this code is useful when you want to group customers by their postal code ranges and count the number of distinct customers for each range, based on data stored in the customer and receipt tables.

S-054: The address of the customer data (df_customer) is one of 埼玉県(Saitama-prefecture), 千葉県(Chiba-prefecture), 東京都(Tokyo-Metropolitan) or 神奈川県(Kanagawa-prefecture). Create a code value for each prefecture, and display 10 cases together with the customer ID and address. The values should be 11 for 埼玉県(Saitama-prefecture), 12 for 千葉県(Chiba-prefecture), 13 for 東京都(Tokyo-Metropolitan) and 14 for 神奈川県(Kanagawa-prefecture).

In [ ]:

%%sql

-- somewhat aggressive description as it is not SQL oriented (note that it becomes very long SQL when the number of categories is large).
-- Code example 1 (fixed and cut out)
SELECT
  customer_id,
  address,
  CASE SUBSTR(address,1, 3)
      WHEN '埼玉県' THEN '11'
      WHEN '千葉県' THEN '12'
      WHEN '東京都' THEN '13'
      WHEN '神奈川' THEN '14'
  END AS prefecture_cd
FROM
  customer
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	address	prefecture_cd
CS021313000114	神奈川県伊勢原市粟窪**********	14
CS037613000071	東京都江東区南砂**********	13
CS031415000172	東京都渋谷区代々木**********	13
CS028811000001	神奈川県横浜市泉区和泉町**********	14
CS001215000145	東京都大田区仲六郷**********	13
CS020401000016	東京都板橋区若木**********	13
CS015414000103	東京都江東区北砂**********	13
CS029403000008	千葉県浦安市海楽**********	12
CS015804000004	東京都江東区北砂**********	13
CS033513000180	神奈川県横浜市旭区善部町**********	14

Commentary :

This SQL code selects the customer_id, address, and a prefecture_cd code for each customer from the customer table.

The CASE expression in this code checks the first three characters of the address column and assigns a corresponding prefecture_cd value based on the following conditions:

If the first three characters of the address column are '埼玉県', the prefecture_cd value is set to '11'.

If the first three characters of the address column are '千葉県', the prefecture_cd value is set to '12'.

If the first three characters of the address column are '東京都', the prefecture_cd value is set to '13'.

If the first three characters of the address column are '神奈川', the prefecture_cd value is set to '14'.

This code is useful when you want to categorize customers by their prefecture (region) based on their address information stored in the customer table. The resulting prefecture_cd column can be used for further analysis or grouping.

In [ ]:

%%sql

-- Code example 2 (using regular expressions)
SELECT
  customer_id,
  address,
  CASE SUBSTRING(address, '^.*?[都道府県]')
      WHEN '埼玉県' THEN '11'
      WHEN '千葉県' THEN '12'
      WHEN '東京都' THEN '13'
      WHEN '神奈川県' THEN '14'
  END AS prefecture_cd
FROM
  customer
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	address	prefecture_cd
CS021313000114	神奈川県伊勢原市粟窪**********	14
CS037613000071	東京都江東区南砂**********	13
CS031415000172	東京都渋谷区代々木**********	13
CS028811000001	神奈川県横浜市泉区和泉町**********	14
CS001215000145	東京都大田区仲六郷**********	13
CS020401000016	東京都板橋区若木**********	13
CS015414000103	東京都江東区北砂**********	13
CS029403000008	千葉県浦安市海楽**********	12
CS015804000004	東京都江東区北砂**********	13
CS033513000180	神奈川県横浜市旭区善部町**********	14

Commentary :

This SQL code selects the customer_id, address, and a prefecture_cd code for each customer from the customer table.

The CASE expression in this code extracts the prefecture name from the address column using a regular expression pattern '^.*?[都道府県]' and assigns a corresponding prefecture_cd value based on the following conditions:

If the extracted prefecture name is '埼玉県', the prefecture_cd value is set to '11'.
If the extracted prefecture name is '千葉県', the prefecture_cd value is set to '12'.
If the extracted prefecture name is '東京都', the prefecture_cd value is set to '13'.
If the extracted prefecture name is '神奈川県', the prefecture_cd value is set to '14'.

This code is similar to the previous example, but it uses the SUBSTRING function to extract the prefecture name from the address column instead of the SUBSTR function. The regular expression pattern '^.*?[都道府県]' matches any string that starts with any characters (^.*?) and ends with one of the characters 都, 道, 府, or 県. This code is useful when the address information in the customer table is not standardized and includes variations in the format of prefecture names. The resulting prefecture_cd column can be used for further analysis or grouping.

In [ ]:

%%sql

-- Code example 3 (using LIKE)
SELECT
  customer_id,
  address,
  CASE
      WHEN address LIKE '埼玉県%' THEN '11'
      WHEN address LIKE '千葉県%' THEN '12'
      WHEN address LIKE '東京都%' THEN '13'
      WHEN address LIKE '神奈川県%' THEN '14'
  END AS prefecture_cd
FROM
    customer
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	address	prefecture_cd
CS021313000114	神奈川県伊勢原市粟窪**********	14
CS037613000071	東京都江東区南砂**********	13
CS031415000172	東京都渋谷区代々木**********	13
CS028811000001	神奈川県横浜市泉区和泉町**********	14
CS001215000145	東京都大田区仲六郷**********	13
CS020401000016	東京都板橋区若木**********	13
CS015414000103	東京都江東区北砂**********	13
CS029403000008	千葉県浦安市海楽**********	12
CS015804000004	東京都江東区北砂**********	13
CS033513000180	神奈川県横浜市旭区善部町**********	14

Commentary :

This SQL code is performing a query on a table named customer and selecting the customer_id, address, and prefecture_cd columns for the first 10 rows. The prefecture_cd column is created using a CASE statement to match the address column using the LIKE operator.

Specifically, the CASE statement checks the address column using the LIKE operator with a string pattern that matches the prefectures of Japan (埼玉県%, 千葉県%, 東京都%, 神奈川県%) followed by any string characters. If the address column matches any of the patterns, the corresponding prefecture code (11, 12, 13, 14) is assigned to the prefecture_cd column.

This code is an alternative way of extracting the prefecture code from the address column, compared to the other two examples shown previously.

S-055: Sum the sales amount (amount) of the receipt details (df_receipt) data for each customer ID (customer_id) and find the quartile point of the total amount. Then, create category values for the total sales amount for each customer based on the following criteria, and display the 10 results together with the customer ID and the total sales amount. The category values shall be 1-4 in order.

Minimum value and above and below the first quartile – Assign 1.

First quartile and above but below the second quartile – Assign 2.

Second quartile and above but below the third quartile – Assign 3.

Third quartile and above – Assign 4.

In [ ]:

%%sql

WITH sales_amount AS(
    SELECT
      customer_id,
      SUM(amount) AS sum_amount
    FROM
      receipt
    GROUP BY
      customer_id
),
sales_pct AS (
    SELECT
      PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY sum_amount) AS pct25,
      PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY sum_amount) AS pct50,
      PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY sum_amount) AS pct75
    FROM
      sales_amount
)
SELECT
  a.customer_id,
  a.sum_amount,
  CASE
      WHEN a.sum_amount < pct25 THEN 1
      WHEN pct25 <= a.sum_amount AND a.sum_amount < pct50 THEN 2
      WHEN pct50 <= a.sum_amount AND a.sum_amount < pct75 THEN 3
      WHEN pct75 <= a.sum_amount THEN 4
  END AS pct_group
FROM 
  sales_amount a
CROSS JOIN 
  sales_pct p
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount	pct_group
CS001311000059	2302	3
CS004614000122	248	1
CS003512000043	298	1
CS011615000061	246	1
CS029212000033	3604	3
CS007515000119	7157	4
CS034515000123	3699	4
CS004315000058	490	1
CS026414000014	6671	4
CS001615000099	768	2

Commentary :

This SQL code performs the following tasks:

It calculates the total sales amount for each customer by summing the amount column from the receipt table for each customer_id using the SUM function and group by the customer_id. The resulting table is saved as a subquery named sales_amount.

It calculates the quartiles (25th, 50th, and 75th) of the sum_amount column from the sales_amount subquery using the PERCENTILE_CONT function, which is a window function.

It joins the sales_amount subquery with the sales_pct subquery using the CROSS JOIN operation to obtain the quartile values for each row of the sales_amount table.

Finally, it calculates a new column pct_group that assigns a percentile group to each customer based on their total sales amount, as follows:

1 if the sum_amount is less than the 25th percentile (pct25)
2 if the sum_amount is greater than or equal to the 25th percentile and less than the 50th percentile (pct50)
3 if the sum_amount is greater than or equal to the 50th percentile and less than the 75th percentile (pct75)
4 if the sum_amount is greater than or equal to the 75th percentile.

The final result is limited to the first 10 rows.

S-056: Calculate the age in 10-year age increments based on the age (age) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id) and date of birth (birth_day). However, all age groups above 60 years should be assumed to be in the 60s. The name of the category representing the age is optional.

In [ ]:

%%sql

SELECT
  customer_id,
  birth_day,
    -- Items for confirmation
    -- age,
  LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60) AS era
FROM
  customer
GROUP BY   customer_id,
  birth_day
-- Terms and conditions for verification
-- HAVING LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60) >= 60
  LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	birth_day	era
CS001105000001	2000-01-14	10
CS001112000009	2006-08-24	10
CS001112000019	2001-01-31	10
CS001112000021	2001-12-15	10
CS001112000023	2004-01-26	10
CS001112000024	2001-01-16	10
CS001112000029	2005-01-24	10
CS001112000030	2003-03-02	10
CS001113000004	2003-02-22	10
CS001113000010	2005-05-09	10

Commentary :

This code is selecting the customer_id, birth_day, and era (a calculated field) from the customer table. The era field is calculated by determining the customer's age (which is not explicitly selected, but is used in the calculation) and then taking the lower value between the age rounded down to the nearest decade and 60.

The results are grouped by customer_id and birth_day. The HAVING clause is then used to filter out customers whose era is less than 60.

In essence, the code is selecting information about customers from the customer table, calculating their age and era, grouping the results by customer_id and birth_day, and filtering out customers whose era is less than 60.

S-057: Using the extraction results of 056 and the gender code (gender_cd), create new category data representing the combination of gender x age and display 10 cases. The value of the category representing the combination shall be arbitrary.

In [ ]:

%%sql

-- Generate a sex and age code by concatenating a one-digit sex code and a two-digit age code.
SELECT
  customer_id,
  birth_day,
  gender_cd || TO_CHAR(LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60), 'FM00') AS gender_era
FROM
  customer
GROUP BY
  customer_id,
  birth_day
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	birth_day	gender_era
CS001105000001	2000-01-14	010
CS001112000009	2006-08-24	110
CS001112000019	2001-01-31	110
CS001112000021	2001-12-15	110
CS001112000023	2004-01-26	110
CS001112000024	2001-01-16	110
CS001112000029	2005-01-24	110
CS001112000030	2003-03-02	110
CS001113000004	2003-02-22	110
CS001113000010	2005-05-09	110

Commentary :

This SQL code selects customer_id, birth_day, and a concatenated string composed of gender_cd and era. The era is calculated by taking the age of the customer and rounding it down to the nearest decade, with a maximum era of 60. The gender_era string is generated by concatenating the gender_cd and era.

The SQL code groups the result by customer_id and birth_day, and returns only the first 10 rows of the result.

Overall, this code creates a summary of customer data that includes age and gender, with an additional era dimension. The resulting table could potentially be used for further analysis and segmentation of customers based on age and gender.

S-058: Create a dummy variable for the gender code (gender_cd) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id).

In [ ]:

%%sql

-- Note that if there are many categories, the SQL can be very long.
-- If you want to reduce one category, simply delete one of the CASE statements.
SELECT
  customer_id,
  CASE WHEN gender_cd = '0' THEN '1' ELSE '0' END AS gender_cd_0,
  CASE WHEN gender_cd = '1' THEN '1' ELSE '0' END AS gender_cd_1,
  CASE WHEN gender_cd = '9' THEN '1' ELSE '0' END AS gender_cd_9
FROM
  customer
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	gender_cd_0	gender_cd_1	gender_cd_9
CS021313000114	0	1	0
CS037613000071	0	0	1
CS031415000172	0	1	0
CS028811000001	0	1	0
CS001215000145	0	1	0
CS020401000016	1	0	0
CS015414000103	0	1	0
CS029403000008	1	0	0
CS015804000004	1	0	0
CS033513000180	0	1	0

Commentary :

This SQL code is selecting customer_id, and creating three new columns: gender_cd_0, gender_cd_1, and gender_cd_9. These new columns are created using the CASE statement.

The CASE statement is a conditional statement that evaluates a set of conditions and returns a corresponding result. In this case, for each customer, the code checks the value of the gender_cd column. If gender_cd is equal to '0', the code returns '1' in the gender_cd_0 column and '0' in the other columns. If gender_cd is equal to '1', the code returns '1' in the gender_cd_1 column and '0' in the other columns. If gender_cd is equal to '9', the code returns '1' in the gender_cd_9 column and '0' in the other columns.

The resulting output will have customer_id, gender_cd_0, gender_cd_1, and gender_cd_9 columns, where for each customer, only one of the new columns will have a value of '1', indicating the gender of the customer.

S-059: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), standardise the total sales amount to mean 0 and standard deviation 1 and display 10 cases with the customer ID and total sales amount. The standard deviation used for standardisation may be either the square root of the variance or the square root of the unbiased variance. However, exclude customer IDs starting with “Z”, as they represent non-members.

In [ ]:

%%sql

-- Code example 1 (standardised with STDDEV_POP)
WITH sales_amount AS(
    SELECT
      customer_id,
      SUM(amount) AS sum_amount
    FROM
      receipt
    WHERE
      customer_id NOT LIKE 'Z%'
    GROUP BY
      customer_id
),
stats_amount AS (
    SELECT
      AVG(sum_amount) AS avg_amount,
      STDDEV_POP(sum_amount) AS stddev_amount
    FROM
      sales_amount
)
SELECT
  customer_id,
  sum_amount,
  (sum_amount - avg_amount) / stddev_amount AS std_amount
FROM 
  sales_amount
CROSS JOIN 
  stats_amount
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount	std_amount
CS001311000059	2302	-0.09032946448390523062
CS004614000122	248	-0.84533488878695858131
CS003512000043	298	-0.82695598361891930792
CS011615000061	246	-0.84607004499368015224
CS029212000033	3604	0.38825722609183744835
CS007515000119	7157	1.6942622273327082
CS034515000123	3699	0.42317714591111206778
CS004315000058	490	-0.75638098777364849812
CS026414000014	6671	1.5156192690993665
CS001615000099	768	-0.65419427503935013810

Commentary :

This code is written in SQL language and is using a database management system that supports SQL.

The code is performing a data analysis task on a table called receipt. The table contains transactional data with columns such as customer_id and amount.

The code is using a common table expression (CTE) to create two temporary tables sales_amount and stats_amount.

The first CTE named sales_amount is calculating the total amount spent by each customer who has a customer_id that does not start with the letter 'Z'. It is grouping the transaction data by customer_id and calculating the sum of amount for each customer_id.

The second CTE named stats_amount is calculating the average and standard deviation of the total amount spent by each customer.

The final SELECT statement is selecting columns from the sales_amount table, and joining it with the stats_amount table using a cross join. It is also calculating the standardized amount spent by each customer by subtracting the average amount from the total amount and then dividing it by the standard deviation.

The LIMIT 10 at the end limits the output to the first 10 rows.

Overall, this code is performing a statistical analysis on the transactional data to find the standardized amount spent by each customer who doesn't have a customer_id starting with the letter 'Z'.

In [ ]:

%%sql

-- Code example 2 (standardised with STDDEV_SAMP, values slightly different from code example 2)
WITH sales_amount AS(
    SELECT
      customer_id,
      SUM(amount) AS sum_amount
    FROM
      receipt
    WHERE
      customer_id NOT LIKE 'Z%'
    GROUP BY
      customer_id
),
stats_amount AS (
    SELECT
      AVG(sum_amount) AS avg_amount,
      STDDEV_SAMP(sum_amount) AS stddev_amount
    FROM
      sales_amount
)
SELECT
  customer_id,
  sum_amount,
  (sum_amount - avg_amount) / stddev_amount AS std_amount
FROM 
  sales_amount
CROSS JOIN 
  stats_amount
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount	std_amount
CS001311000059	2302	-0.09032402671702291270
CS004614000122	248	-0.84528400025253654164
CS003512000043	298	-0.82690620148098070841
CS011615000061	246	-0.84601911220339877497
CS029212000033	3604	0.38823385329429098451
CS007515000119	7157	1.6941602340010485
CS034515000123	3699	0.42315167096024706764
CS004315000058	490	-0.75633545419820630882
CS026414000014	6671	1.5155280299415258
CS001615000099	768	-0.65415489302835587608

Commentary :

This code is very similar to the previous one, but there is one important difference in the stats_amount CTE.

Instead of using the STDDEV_POP function to calculate the standard deviation, it is using the STDDEV_SAMP function.

STDDEV_SAMP calculates the sample standard deviation, which is an estimate of the population standard deviation based on a sample of data.

In contrast, STDDEV_POP calculates the exact population standard deviation, assuming that the entire population is available for analysis.

So, the output of this code will be similar to the previous one, but the standard deviation calculation will be based on a sample of the data rather than the entire population.

In general, if the dataset is very large, the difference between the sample and population standard deviation will be small. However, if the dataset is small, the difference can be significant.

S-060: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), normalise the total sales amount to a minimum value of 0 and a maximum value of 1 and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

%%sql

WITH sales_amount AS(
    SELECT
      customer_id,
      SUM(amount) AS sum_amount
    FROM
      receipt
    WHERE
      customer_id NOT LIKE 'Z%'
    GROUP BY
      customer_id
),
stats_amount AS (
    SELECT
      MAX(sum_amount) AS max_amount,
      MIN(sum_amount) AS min_amount
    FROM
      sales_amount
)
SELECT
  customer_id,
  sum_amount,
  1.0 * (sum_amount - min_amount) / (max_amount -  min_amount) AS scale_amount
FROM 
  sales_amount
CROSS JOIN 
  stats_amount
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount	scale_amount
CS001311000059	2302	0.09697601668404588113
CS004614000122	248	0.00773375043448036149
CS003512000043	298	0.00990615224191866528
CS011615000061	246	0.00764685436218282934
CS029212000033	3604	0.15354535974973931178
CS007515000119	7157	0.30791623218630517901
CS034515000123	3699	0.15767292318387208898
CS004315000058	490	0.01824817518248175182
CS026414000014	6671	0.28680048661800486618
CS001615000099	768	0.03032672923183872089

Commentary :

This code is also analyzing the receipt table, but instead of calculating the standardized amount, it is calculating the scaled amount spent by each customer.

The code is using two CTEs, sales_amount and stats_amount, which are similar to the previous code.

The first CTE calculates the total amount spent by each customer who has a customer_id that does not start with the letter 'Z', and the second CTE calculates the maximum and minimum total amount spent by any customer.

The final SELECT statement selects columns from the sales_amount table and joins it with the stats_amount table using a cross join.

It also calculates the scaled amount spent by each customer, which is calculated by subtracting the minimum amount spent from the total amount spent and then dividing it by the range of total amount spent, which is the difference between the maximum and minimum amount spent.

The 1.0 * before the expression is used to convert the result to a floating-point number, which is necessary if any of the values used in the expression are floating-point numbers.

The LIMIT 10 at the end limits the output to the first 10 rows.

Overall, this code is performing a data analysis task to find the scaled amount spent by each customer who doesn't have a customer_id starting with the letter 'Z'. The scaled amount is a value between 0 and 1 that represents the proportion of the total amount spent by each customer relative to the range of total amount spent by all customers.

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - SQL part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - SQL part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - SQL part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - SQL part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - SQL part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.