参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- The database is PostgreSQL13
- First, execute the following cell
- You can issue SQL by writing %%sql in the cell
- You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
- You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
- It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
- If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
- Names, addresses etc. are dummy data and do not exist.
%load_ext sql
import os
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)
# Set up Magic commands to write SQL
%sql $dsl
'Connected: padawan@dsdojo_db'
How to use
- You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.
%%sql
SELECT 'You can run it like this' AS sample;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
sample |
---|
You can run it like this |
100 data processing knocks
S-041: Sum up the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd) and calculate the increase or decrease in sales amount from the date of the last sale. And display 10 results.
%%sql
WITH sales_amount_by_date AS (
SELECT
sales_ymd,
SUM(amount) AS amount
FROM
receipt
GROUP BY
sales_ymd
),
sales_amount_by_date_with_lag as (
SELECT
sales_ymd,
LAG(sales_ymd, 1) OVER(ORDER BY sales_ymd) lag_ymd,
amount,
LAG(amount, 1) OVER(ORDER BY sales_ymd) AS lag_amount
FROM
sales_amount_by_date
)
SELECT
sales_ymd,
amount,
lag_ymd,
lag_amount,
amount - lag_amount AS diff_amount
FROM
sales_amount_by_date_with_lag
ORDER BY
sales_ymd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
sales_ymd | amount | lag_ymd | lag_amount | diff_amount |
---|---|---|---|---|
20170101 | 33723 | None | None | None |
20170102 | 24165 | 20170101 | 33723 | -9558 |
20170103 | 27503 | 20170102 | 24165 | 3338 |
20170104 | 36165 | 20170103 | 27503 | 8662 |
20170105 | 37830 | 20170104 | 36165 | 1665 |
20170106 | 32387 | 20170105 | 37830 | -5443 |
20170107 | 23415 | 20170106 | 32387 | -8972 |
20170108 | 24737 | 20170107 | 23415 | 1322 |
20170109 | 26718 | 20170108 | 24737 | 1981 |
20170110 | 20143 | 20170109 | 26718 | -6575 |
Commentary :
This code is written in SQL language and is using the "WITH" clause to create two temporary tables named "sales_amount_by_date" and "sales_amount_by_date_with_lag". The code is then selecting specific columns from the "sales_amount_by_date_with_lag" table and calculating the difference in sales amount between consecutive dates using the "LAG" function.
More specifically, the code is performing the following operations:
The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.
The second "WITH" clause creates another temporary table called "sales_amount_by_date_with_lag" that uses the "LAG" function to calculate the previous date's sales amount and the difference in sales amount between consecutive dates.
The final "SELECT" statement selects columns from the "sales_amount_by_date_with_lag" table, including the sales date, sales amount, previous date, previous sales amount, and the difference in sales amount between consecutive dates.
The results are then sorted in ascending order by sales date and limited to the first 10 rows.
In summary, the code calculates the difference in sales amount between consecutive dates in the "receipt" table using the "LAG" function and displays the results for the first 10 rows.
S-042: Tabulate the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd), and for each date, combine the data of the last, previous and three previous sales dates. Then display 10 results.
%%sql
-- Code example 1: Vertical holding case
WITH sales_amount_by_date AS (
SELECT
sales_ymd,
SUM(amount) AS amount
FROM
receipt
GROUP BY
sales_ymd
),
sales_amount_lag_date AS (
SELECT
sales_ymd,
LAG(sales_ymd, 3) OVER (ORDER BY sales_ymd) AS lag_date_3,
amount
FROM
sales_amount_by_date
)
SELECT
a.sales_ymd,
a.amount,
b.sales_ymd AS lag_ymd,
b.amount AS lag_amount
FROM
sales_amount_lag_date a
JOIN
sales_amount_lag_date b
ON
(
a.lag_date_3 IS NULL
OR a.lag_date_3 <= b.sales_ymd
)
AND b.sales_ymd < a.sales_ymd
ORDER BY
sales_ymd,
lag_ymd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
sales_ymd | amount | lag_ymd | lag_amount |
---|---|---|---|
20170102 | 24165 | 20170101 | 33723 |
20170103 | 27503 | 20170101 | 33723 |
20170103 | 27503 | 20170102 | 24165 |
20170104 | 36165 | 20170101 | 33723 |
20170104 | 36165 | 20170102 | 24165 |
20170104 | 36165 | 20170103 | 27503 |
20170105 | 37830 | 20170102 | 24165 |
20170105 | 37830 | 20170103 | 27503 |
20170105 | 37830 | 20170104 | 36165 |
20170106 | 32387 | 20170103 | 27503 |
Commentary :
This code is also written in SQL language and is using the "WITH" clause to create two temporary tables named "sales_amount_by_date" and "sales_amount_lag_date". The code is then performing a self-join on the "sales_amount_lag_date" table and filtering the results based on certain conditions.
More specifically, the code is performing the following operations:
The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.
The second "WITH" clause creates another temporary table called "sales_amount_lag_date" that uses the "LAG" function to calculate the sales date three days prior and the sales amount for each date in the "sales_amount_by_date" table.
The final "SELECT" statement selects specific columns from the "sales_amount_lag_date" table and performs a self-join with itself. The join condition filters the results to include only rows where the sales date of one row is less than the sales date of another row and where the sales date of the second row is less than or equal to the sales date of the first row three days prior.
The results are then sorted in ascending order by sales date and previous sales date and limited to the first 10 rows.
In summary, the code is performing a self-join on a table that includes the sales date and sales amount for each date in the "receipt" table. The join condition filters the results to include only rows where the sales date of one row is less than the sales date of another row and where the sales date of the second row is less than or equal to the sales date of the first row three days prior. The results display the sales date, sales amount, previous sales date, and previous sales amount for the first 10 rows.
%%sql
-- Code example 2: horizontal holding case
WITH sales_amount_by_date AS (
SELECT
sales_ymd,
SUM(amount) AS amount
FROM
receipt
GROUP BY
sales_ymd
),
sales_amount_with_lag AS (
SELECT
sales_ymd,
amount,
LAG(sales_ymd, 1) OVER (ORDER BY sales_ymd) AS lag_ymd_1,
LAG(amount, 1) OVER (ORDER BY sales_ymd) AS lag_amount_1,
LAG(sales_ymd, 2) OVER (ORDER BY sales_ymd) AS lag_ymd_2,
LAG(amount, 2) OVER (ORDER BY sales_ymd) AS lag_amount_2,
LAG(sales_ymd, 3) OVER (ORDER BY sales_ymd) AS lag_ymd_3,
LAG(amount, 3) OVER (ORDER BY sales_ymd) AS lag_amount_3
FROM
sales_amount_by_date
)
SELECT
sales_ymd,
amount,
lag_ymd_1,
lag_amount_1,
lag_ymd_2,
lag_amount_2,
lag_ymd_3,
lag_amount_3
FROM
sales_amount_with_lag
WHERE
lag_ymd_3 IS NOT NULL
ORDER BY
sales_ymd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
sales_ymd | amount | lag_ymd_1 | lag_amount_1 | lag_ymd_2 | lag_amount_2 | lag_ymd_3 | lag_amount_3 |
---|---|---|---|---|---|---|---|
20170104 | 36165 | 20170103 | 27503 | 20170102 | 24165 | 20170101 | 33723 |
20170105 | 37830 | 20170104 | 36165 | 20170103 | 27503 | 20170102 | 24165 |
20170106 | 32387 | 20170105 | 37830 | 20170104 | 36165 | 20170103 | 27503 |
20170107 | 23415 | 20170106 | 32387 | 20170105 | 37830 | 20170104 | 36165 |
20170108 | 24737 | 20170107 | 23415 | 20170106 | 32387 | 20170105 | 37830 |
20170109 | 26718 | 20170108 | 24737 | 20170107 | 23415 | 20170106 | 32387 |
20170110 | 20143 | 20170109 | 26718 | 20170108 | 24737 | 20170107 | 23415 |
20170111 | 24287 | 20170110 | 20143 | 20170109 | 26718 | 20170108 | 24737 |
20170112 | 23526 | 20170111 | 24287 | 20170110 | 20143 | 20170109 | 26718 |
20170113 | 28004 | 20170112 | 23526 | 20170111 | 24287 | 20170110 | 20143 |
Commentary :
This code is also written in SQL language and is using the "WITH" clause to create a temporary table named "sales_amount_by_date". The code is then using the "LAG" function to create additional columns in another temporary table named "sales_amount_with_lag". Finally, the code is selecting specific columns from the "sales_amount_with_lag" table and filtering the results based on a certain condition.
More specifically, the code is performing the following operations:
The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.
The second "WITH" clause creates another temporary table called "sales_amount_with_lag" that uses the "LAG" function to calculate the sales date and sales amount for the previous three days for each date in the "sales_amount_by_date" table.
The final "SELECT" statement selects specific columns from the "sales_amount_with_lag" table and filters the results to include only rows where the third previous date is not null. This means that the first two rows of the table will be excluded since they do not have a third previous date.
The results are then sorted in ascending order by sales date and limited to the first 10 rows.
In summary, the code is calculating the total sales amount for each date in the "receipt" table and then using the "LAG" function to calculate the sales date and sales amount for the previous three days for each date. The results display the sales date and sales amount for each date and the sales date and sales amount for the previous three days. The results are filtered to exclude the first two rows since they do not have a third previous date. The first 10 rows of the results are displayed in ascending order by sales date.
S-043: Combine receipt details data (df_receipt) and customer data (df_customer) to create sales summary data, summing the sales amount (amount) by gender code (gender_cd) and age (calculated from age). Gender_cd should be 0 for male, 1 for female and 9 for unknown.
However, the item structure should consist of four items: age, female sales amount, male sales amount and sales amount of unknown gender (vertical cross tabulation of age and horizontal cross tabulation of gender). In addition, the age shall be in deciles of 10 years.
%%sql
DROP TABLE IF EXISTS sales_summary;
CREATE TABLE sales_summary AS
WITH gender_era_amount AS (
SELECT
TRUNC(age / 10) * 10 AS era,
c.gender_cd,
SUM(r.amount) AS amount
FROM
customer c
JOIN
receipt r
ON
c.customer_id = r.customer_id
GROUP BY
era,
c.gender_cd
)
SELECT
era,
SUM(CASE WHEN gender_cd = '0' THEN amount END) AS male,
SUM(CASE WHEN gender_cd = '1' THEN amount END) AS female,
SUM(CASE WHEN gender_cd = '9' THEN amount END) AS unknown
FROM
gender_era_amount
GROUP BY
era
ORDER BY
era
;
SELECT
*
FROM
sales_summary
;
* postgresql://padawan:***@db:5432/dsdojo_db Done. 9 rows affected. 9 rows affected.
era | male | female | unknown |
---|---|---|---|
10.0 | 1591 | 149836 | 4317 |
20.0 | 72940 | 1363724 | 44328 |
30.0 | 177322 | 693047 | 50441 |
40.0 | 19355 | 9320791 | 483512 |
50.0 | 54320 | 6685192 | 342923 |
60.0 | 272469 | 987741 | 71418 |
70.0 | 13435 | 29764 | 2427 |
80.0 | 46360 | 262923 | 5111 |
90.0 | None | 6260 | None |
Commentary :
This SQL code creates a table namedsales_summary
that summarizes the total sales amount by age group (era) and gender.
First, a Common Table Expression (CTE) namedgender_era_amount
is created using a join between thecustomer
andreceipt
tables. The CTE calculates the total sales amount by grouping the records based on age group (era
) and gender.
Next, thesales_summary
table is created by aggregating the data from thegender_era_amount
CTE. The table has four columns:era
,male
,female
, andunknown
. TheSUM(CASE...)
statements within the SELECT clause are used to pivot the gender-specific sales amounts into separate columns.
Finally, the SELECT statement is used to retrieve all the rows from the newly createdsales_summary
table.
S-044: The sales summary data (df_sales_summary) created in 043 had sales by gender carried horizontally. From this data, convert the gender into three items: age, gender code and sales amount, by having the gender carried vertically. The gender code should be “00” for male, “01” for female and “99” for unknown.
%%sql
-- somewhat aggressive description as it is not SQL-oriented (note that it can be very long SQL when the number of categories is large).
SELECT
era,
'00' AS gender_cd,
male AS amount
FROM
sales_summary
UNION ALL
SELECT
era,
'01' AS gender_cd,
female AS amount
FROM
sales_summary
UNION ALL
SELECT
era,
'99' AS gender_cd,
unknown AS amount
FROM
sales_summary
;
* postgresql://padawan:***@db:5432/dsdojo_db 27 rows affected.
era | gender_cd | amount |
---|---|---|
10.0 | 00 | 1591 |
20.0 | 00 | 72940 |
30.0 | 00 | 177322 |
40.0 | 00 | 19355 |
50.0 | 00 | 54320 |
60.0 | 00 | 272469 |
70.0 | 00 | 13435 |
80.0 | 00 | 46360 |
90.0 | 00 | None |
10.0 | 01 | 149836 |
20.0 | 01 | 1363724 |
30.0 | 01 | 693047 |
40.0 | 01 | 9320791 |
50.0 | 01 | 6685192 |
60.0 | 01 | 987741 |
70.0 | 01 | 29764 |
80.0 | 01 | 262923 |
90.0 | 01 | 6260 |
10.0 | 99 | 4317 |
20.0 | 99 | 44328 |
30.0 | 99 | 50441 |
40.0 | 99 | 483512 |
50.0 | 99 | 342923 |
60.0 | 99 | 71418 |
70.0 | 99 | 2427 |
80.0 | 99 | 5111 |
90.0 | 99 | None |
Commentary :
This code is written in SQL and retrieves data from a table calledsales_summary
. The purpose of this code is to transform the data in thesales_summary
table to a format that can be more easily analyzed.
TheSELECT
statement is used to retrieve data from thesales_summary
table, and theUNION ALL
operator is used to combine the results of multipleSELECT
statements into a single result set.
The firstSELECT
statement retrieves data from thesales_summary
table and selects theera
andmale
columns. It also includes a new column calledgender_cd
, which has a value of00
. This column is added using the string literal'00' AS gender_cd
. TheAS
keyword is used to rename the column togender_cd
. The result set of thisSELECT
statement contains three columns:era
,gender_cd
, andamount
.
The secondSELECT
statement is similar to the first one, but it selects theera
andfemale
columns and sets thegender_cd
column to01
. The result set of thisSELECT
statement also contains three columns:era
,gender_cd
, andamount
.
The thirdSELECT
statement is also similar to the first two, but it selects theera
andunknown
columns and sets thegender_cd
column to99
. The result set of thisSELECT
statement also contains three columns:era
,gender_cd
, andamount
.
Finally, theUNION ALL
operator is used to combine the result sets of the threeSELECT
statements into a single result set. The resulting table will have one row for each combination ofera
andgender_cd
, and theamount
column will contain the value of eithermale
,female
, orunknown
, depending on the value ofgender_cd
.
S-045: The date of birth (birth_day) in the customer data (df_customer) is held in date type. Convert this to a string in YYYYMMDD format and display 10 cases together with the customer ID (customer_id).
%%sql
SELECT
customer_id,
TO_CHAR(birth_day, 'YYYYMMDD') AS birth_day
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | birth_day |
---|---|
CS021313000114 | 19810429 |
CS037613000071 | 19520401 |
CS031415000172 | 19761004 |
CS028811000001 | 19330327 |
CS001215000145 | 19950329 |
CS020401000016 | 19740915 |
CS015414000103 | 19770809 |
CS029403000008 | 19730817 |
CS015804000004 | 19310502 |
CS033513000180 | 19620711 |
Commentary :
This code is written in SQL and retrieves data from a table calledcustomer
. The purpose of this code is to retrieve thecustomer_id
andbirth_day
columns from thecustomer
table and format thebirth_day
column in a specific way.
TheSELECT
statement is used to retrieve data from thecustomer
table, and theLIMIT
clause is used to limit the number of rows returned to 10.
TheTO_CHAR
function is used to convert thebirth_day
column to a string with the specified format. In this case, the format is'YYYYMMDD'
, which represents the year, month, and day of the birth date. The resulting string will have a length of 8 characters, with the year represented by 4 characters, the month by 2 characters, and the day by 2 characters.
TheAS
keyword is used to rename the resulting column tobirth_day
. The result set of thisSELECT
statement contains two columns:customer_id
andbirth_day
. Thebirth_day
column contains the birth date of each customer in the specified format.
Overall, this code is useful when you want to retrieve thecustomer_id
andbirth_day
columns from thecustomer
table and format thebirth_day
column in a specific way. This formatted data can be useful for analysis or reporting purposes.
S-046: The application date (application_date) of the customer data (df_customer) holds data in a string type in YYYYMMDD format. Convert this to a date type and display 10 cases together with the customer ID (customer_id).
%%sql
SELECT
customer_id,
TO_DATE(application_date, 'YYYYMMDD') AS application_date
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | application_date |
---|---|
CS021313000114 | 2015-09-05 |
CS037613000071 | 2015-04-14 |
CS031415000172 | 2015-05-29 |
CS028811000001 | 2016-01-15 |
CS001215000145 | 2017-06-05 |
CS020401000016 | 2015-02-25 |
CS015414000103 | 2015-07-22 |
CS029403000008 | 2015-05-15 |
CS015804000004 | 2015-06-07 |
CS033513000180 | 2015-07-28 |
Commentary :
This code is written in SQL and retrieves data from a table calledcustomer
. The purpose of this code is to retrieve thecustomer_id
andapplication_date
columns from thecustomer
table and convert theapplication_date
column to a date format.
TheSELECT
statement is used to retrieve data from thecustomer
table, and theLIMIT
clause is used to limit the number of rows returned to 10.
TheTO_DATE
function is used to convert theapplication_date
column from a string to a date format. In this case, the format is'YYYYMMDD'
, which represents the year, month, and day of the application date. The resulting date value will be in the formatyyyy-mm-dd
.
TheAS
keyword is used to rename the resulting column toapplication_date
. The result set of thisSELECT
statement contains two columns:customer_id
andapplication_date
. Theapplication_date
column contains the application date of each customer in the date format.
Overall, this code is useful when you want to retrieve thecustomer_id
andapplication_date
columns from thecustomer
table and convert theapplication_date
column to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.
S-047: The sales date (sales_ymd) of the receipt details data (df_receipt) holds data in numeric type in YYYYMMDD format. Convert this to a date type and display 10 items together with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
%%sql
SELECT
receipt_no,
receipt_sub_no,
TO_DATE(CAST(sales_ymd AS VARCHAR), 'YYYYMMDD') AS sales_ymd
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
receipt_no | receipt_sub_no | sales_ymd |
---|---|---|
112 | 1 | 2018-11-03 |
1132 | 2 | 2018-11-18 |
1102 | 1 | 2017-07-12 |
1132 | 1 | 2019-02-05 |
1102 | 2 | 2018-08-21 |
1112 | 1 | 2019-06-05 |
1102 | 2 | 2018-12-05 |
1102 | 1 | 2019-09-22 |
1112 | 2 | 2017-05-04 |
1102 | 1 | 2019-10-10 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt
. The purpose of this code is to retrieve thereceipt_no
,receipt_sub_no
, andsales_ymd
columns from thereceipt
table and convert thesales_ymd
column to a date format.
TheSELECT
statement is used to retrieve data from thereceipt
table, and theLIMIT
clause is used to limit the number of rows returned to 10.
TheCAST
function is used to convert thesales_ymd
column from an integer to a string. The resulting string will contain the year, month, and day of the sales date in the format'YYYYMMDD'
.
TheTO_DATE
function is used to convert thesales_ymd
string to a date format. In this case, the format is'YYYYMMDD'
, which represents the year, month, and day of the sales date. The resulting date value will be in the formatyyyy-mm-dd
.
TheAS
keyword is used to rename the resulting column tosales_ymd
. The result set of thisSELECT
statement contains three columns:receipt_no
,receipt_sub_no
, andsales_ymd
. Thesales_ymd
column contains the sales date of each receipt in the date format.
Overall, this code is useful when you want to retrieve thereceipt_no
,receipt_sub_no
, andsales_ymd
columns from thereceipt
table and convert thesales_ymd
column to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.
S-048: The sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) holds data in UNIX seconds of numeric type. Convert it to date type and display 10 cases together with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
%%sql
SELECT
receipt_no,
receipt_sub_no,
CAST(TO_TIMESTAMP(sales_epoch) AS DATE) AS sales_ymd
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
receipt_no | receipt_sub_no | sales_ymd |
---|---|---|
112 | 1 | 2018-11-03 |
1132 | 2 | 2018-11-18 |
1102 | 1 | 2017-07-12 |
1132 | 1 | 2019-02-05 |
1102 | 2 | 2018-08-21 |
1112 | 1 | 2019-06-05 |
1102 | 2 | 2018-12-05 |
1102 | 1 | 2019-09-22 |
1112 | 2 | 2017-05-04 |
1102 | 1 | 2019-10-10 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt
. The purpose of this code is to retrieve thereceipt_no
,receipt_sub_no
, andsales_epoch
columns from thereceipt
table and convert thesales_epoch
column to a date format.
TheSELECT
statement is used to retrieve data from thereceipt
table, and theLIMIT
clause is used to limit the number of rows returned to 10.
TheTO_TIMESTAMP
function is used to convert thesales_epoch
column from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.
TheCAST
function is used to convert the timestamp to a date format. In this case, the resulting date value will be in the formatyyyy-mm-dd
.
TheAS
keyword is used to rename the resulting column tosales_ymd
. The result set of thisSELECT
statement contains three columns:receipt_no
,receipt_sub_no
, andsales_ymd
. Thesales_ymd
column contains the sales date of each receipt in the date format.
Overall, this code is useful when you want to retrieve thereceipt_no
,receipt_sub_no
, andsales_epoch
columns from thereceipt
table and convert thesales_epoch
column to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.
S-049: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “年(year)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
%%sql
SELECT
receipt_no,
receipt_sub_no,
EXTRACT(YEAR FROM TO_TIMESTAMP(sales_epoch)) AS sales_year
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
receipt_no | receipt_sub_no | sales_year |
---|---|---|
112 | 1 | 2018 |
1132 | 2 | 2018 |
1102 | 1 | 2017 |
1132 | 1 | 2019 |
1102 | 2 | 2018 |
1112 | 1 | 2019 |
1102 | 2 | 2018 |
1102 | 1 | 2019 |
1112 | 2 | 2017 |
1102 | 1 | 2019 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt
. The purpose of this code is to retrieve thereceipt_no
,receipt_sub_no
, andsales_epoch
columns from thereceipt
table and extract the year from thesales_epoch
timestamp column.
TheSELECT
statement is used to retrieve data from thereceipt
table, and theLIMIT
clause is used to limit the number of rows returned to 10.
TheTO_TIMESTAMP
function is used to convert thesales_epoch
column from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.
TheEXTRACT
function is used to extract the year from thesales_epoch
timestamp. The resulting value will be an integer representing the year.
TheAS
keyword is used to rename the resulting column tosales_year
. The result set of thisSELECT
statement contains three columns:receipt_no
,receipt_sub_no
, andsales_year
. Thesales_year
column contains the year of the sales date of each receipt.
Overall, this code is useful when you want to retrieve thereceipt_no
,receipt_sub_no
, andsales_epoch
columns from thereceipt
table and extract the year from thesales_epoch
timestamp column. This extracted data can be useful for grouping or filtering data based on the year of the sales date.
S-050: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “月(month)” and display 10 items with receipt number (receipt_no) and receipt sub number (receipt_sub_no). Note that the “月(month)” should be retrieved with two digits filled with zero.
%%sql
SELECT
receipt_no,
receipt_sub_no,
TO_CHAR(
EXTRACT(MONTH FROM TO_TIMESTAMP(sales_epoch)),
'FM00'
) AS sales_month
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
receipt_no | receipt_sub_no | sales_month |
---|---|---|
112 | 1 | 11 |
1132 | 2 | 11 |
1102 | 1 | 07 |
1132 | 1 | 02 |
1102 | 2 | 08 |
1112 | 1 | 06 |
1102 | 2 | 12 |
1102 | 1 | 09 |
1112 | 2 | 05 |
1102 | 1 | 10 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt
. The purpose of this code is to retrieve thereceipt_no
,receipt_sub_no
, andsales_epoch
columns from thereceipt
table and extract the month from thesales_epoch
timestamp column, then format the resulting month as a two-digit string with a leading zero.
TheSELECT
statement is used to retrieve data from thereceipt
table, and theLIMIT
clause is used to limit the number of rows returned to 10.
TheTO_TIMESTAMP
function is used to convert thesales_epoch
column from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.
TheEXTRACT
function is used to extract the month from thesales_epoch
timestamp. The resulting value will be an integer representing the month.
TheTO_CHAR
function is used to convert the extracted month to a string with a leading zero if necessary. The 'FM00' format string specifies that the resulting string should have a fixed width of two characters, with any leading zeros removed.
TheAS
keyword is used to rename the resulting column tosales_month
. The result set of thisSELECT
statement contains three columns:receipt_no
,receipt_sub_no
, andsales_month
. Thesales_month
column contains the month of the sales date of each receipt as a two-digit string with a leading zero.
Overall, this code is useful when you want to retrieve thereceipt_no
,receipt_sub_no
, andsales_epoch
columns from thereceipt
table and extract the month from thesales_epoch
timestamp column, then format the resulting month as a two-digit string with a leading zero. This formatted data can be useful for grouping or filtering data based on the month of the sales date.
S-051: Convert the sales epoch seconds of the receipt details data (df_receipt) to date type, extract only the “日(day)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no). Note that the “日(day)” should be extracted with two digits filled with zero.
%%sql
SELECT
receipt_no,
receipt_sub_no,
TO_CHAR(EXTRACT(DAY FROM TO_TIMESTAMP(sales_epoch)), 'FM00') AS sales_day
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
receipt_no | receipt_sub_no | sales_day |
---|---|---|
112 | 1 | 03 |
1132 | 2 | 18 |
1102 | 1 | 12 |
1132 | 1 | 05 |
1102 | 2 | 21 |
1112 | 1 | 05 |
1102 | 2 | 05 |
1102 | 1 | 22 |
1112 | 2 | 04 |
1102 | 1 | 10 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt
. The purpose of this code is to retrieve thereceipt_no
,receipt_sub_no
, andsales_epoch
columns from thereceipt
table and extract the day from thesales_epoch
timestamp column, then format the resulting day as a two-digit string with a leading zero.
TheSELECT
statement is used to retrieve data from thereceipt
table, and theLIMIT
clause is used to limit the number of rows returned to 10.
TheTO_TIMESTAMP
function is used to convert thesales_epoch
column from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.
TheEXTRACT
function is used to extract the day from thesales_epoch
timestamp. The resulting value will be an integer representing the day.
TheTO_CHAR
function is used to convert the extracted day to a string with a leading zero if necessary. The 'FM00' format string specifies that the resulting string should have a fixed width of two characters, with any leading zeros removed.
TheAS
keyword is used to rename the resulting column tosales_day
. The result set of thisSELECT
statement contains three columns:receipt_no
,receipt_sub_no
, andsales_day
. Thesales_day
column contains the day of the sales date of each receipt as a two-digit string with a leading zero.
Overall, this code is useful when you want to retrieve thereceipt_no
,receipt_sub_no
, andsales_epoch
columns from thereceipt
table and extract the day from thesales_epoch
timestamp column, then format the resulting day as a two-digit string with a leading zero. This formatted data can be useful for grouping or filtering data based on the day of the sales date.
S-052: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), binaryise the total sales amount to 0 for amounts less than 2,000 yen and 1 for amounts greater than 2,000 yen, and display 10 items with the customer ID and total sales amount. Display 10 items together with the customer ID and the total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
%%sql
SELECT
customer_id,
SUM(amount) AS sum_amount,
CASE WHEN SUM(amount) > 2000 THEN 1 ELSE 0 END AS sales_flg
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount | sales_flg |
---|---|---|
CS001311000059 | 2302 | 1 |
CS004614000122 | 248 | 0 |
CS003512000043 | 298 | 0 |
CS011615000061 | 246 | 0 |
CS029212000033 | 3604 | 1 |
CS007515000119 | 7157 | 1 |
CS034515000123 | 3699 | 1 |
CS004315000058 | 490 | 0 |
CS026414000014 | 6671 | 1 |
CS001615000099 | 768 | 0 |
Commentary :
This SQL code is selecting data from a table calledreceipt
. The purpose of this code is to calculate the total sales amount for each customer in thereceipt
table, and also to create a flag indicating if the total sales amount for each customer is greater than 2000 or not.
TheSELECT
statement retrieves thecustomer_id
column, the total sales amount for each customer calculated using theSUM
function, and aCASE
expression that returns1
if the total sales amount is greater than 2000 and0
otherwise. The resulting column that contains the flag is namedsales_flg
.
TheWHERE
clause filters out any records where the customer ID starts with the letter 'Z'. This could be used to exclude certain customers from the calculation or flagging.
TheGROUP BY
clause groups the results bycustomer_id
so that the total sales amount and flag are calculated for each customer.
TheLIMIT
clause limits the output to the first 10 results.
Overall, this code is useful when you want to calculate the total sales amount for each customer in areceipt
table and create a flag indicating if the total sales amount is greater than a certain threshold (in this case, 2000) or not. This could be useful for identifying high-spending customers or for other types of analysis where customer sales amounts need to be aggregated and analyzed.
S-053: Binarise the postcode (postal_cd) of the customer data (df_customer) to 1 for Tokyo (the first three digits are 100 to 209) and 0 for all others. Further, combine it with the receipt details data (df_receipt) and count the number of customers with sales results for the entire period, for each binary value created.
%%sql
WITH cust AS (
SELECT
customer_id,
postal_cd,
CASE WHEN CAST(SUBSTR(postal_cd, 1, 3) AS INTEGER) BETWEEN 100 AND 209 THEN 1 ELSE 0 END AS postal_flg
FROM
customer
),
rect AS(
SELECT DISTINCT
customer_id
FROM
receipt
)
SELECT
c.postal_flg,
COUNT(DISTINCT c.customer_id) AS customer_cnt
FROM
cust c
JOIN
rect r
USING
(customer_id)
GROUP BY
c.postal_flg
;
* postgresql://padawan:***@db:5432/dsdojo_db 2 rows affected.
postal_flg | customer_cnt |
---|---|
0 | 3906 |
1 | 4400 |
Commentary :
This SQL code contains a Common Table Expression (CTE) namedcust
, which selects data from thecustomer
table and calculates apostal_flg
flag for each customer based on theirpostal_cd
.
Thepostal_flg
flag is calculated using aCASE
expression, which checks if the first three characters of thepostal_cd
are between 100 and 209 (inclusive), and assigns a value of 1 if true and 0 if false. This flag could be used to group customers by postal code ranges, for example.
The second CTE namedrect
selects distinctcustomer_id
s from thereceipt
table.
The finalSELECT
statement joins thecust
andrect
CTEs on thecustomer_id
column and groups the results by thepostal_flg
column. TheCOUNT(DISTINCT c.customer_id)
function counts the number of distinct customers for eachpostal_flg
value.
Overall, this code is useful when you want to group customers by their postal code ranges and count the number of distinct customers for each range, based on data stored in thecustomer
andreceipt
tables.
S-054: The address of the customer data (df_customer) is one of 埼玉県(Saitama-prefecture), 千葉県(Chiba-prefecture), 東京都(Tokyo-Metropolitan) or 神奈川県(Kanagawa-prefecture). Create a code value for each prefecture, and display 10 cases together with the customer ID and address. The values should be 11 for 埼玉県(Saitama-prefecture), 12 for 千葉県(Chiba-prefecture), 13 for 東京都(Tokyo-Metropolitan) and 14 for 神奈川県(Kanagawa-prefecture).
%%sql
-- somewhat aggressive description as it is not SQL oriented (note that it becomes very long SQL when the number of categories is large).
-- Code example 1 (fixed and cut out)
SELECT
customer_id,
address,
CASE SUBSTR(address,1, 3)
WHEN '埼玉県' THEN '11'
WHEN '千葉県' THEN '12'
WHEN '東京都' THEN '13'
WHEN '神奈川' THEN '14'
END AS prefecture_cd
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | address | prefecture_cd |
---|---|---|
CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
CS037613000071 | 東京都江東区南砂********** | 13 |
CS031415000172 | 東京都渋谷区代々木********** | 13 |
CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
CS001215000145 | 東京都大田区仲六郷********** | 13 |
CS020401000016 | 東京都板橋区若木********** | 13 |
CS015414000103 | 東京都江東区北砂********** | 13 |
CS029403000008 | 千葉県浦安市海楽********** | 12 |
CS015804000004 | 東京都江東区北砂********** | 13 |
CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This SQL code selects thecustomer_id
,address
, and aprefecture_cd
code for each customer from thecustomer
table.
TheCASE
expression in this code checks the first three characters of theaddress
column and assigns a correspondingprefecture_cd
value based on the following conditions:
If the first three characters of theaddress
column are '埼玉県', theprefecture_cd
value is set to '11'.
If the first three characters of theaddress
column are '千葉県', theprefecture_cd
value is set to '12'.
If the first three characters of theaddress
column are '東京都', theprefecture_cd
value is set to '13'.
If the first three characters of theaddress
column are '神奈川', theprefecture_cd
value is set to '14'.
This code is useful when you want to categorize customers by their prefecture (region) based on their address information stored in thecustomer
table. The resultingprefecture_cd
column can be used for further analysis or grouping.
%%sql
-- Code example 2 (using regular expressions)
SELECT
customer_id,
address,
CASE SUBSTRING(address, '^.*?[都道府県]')
WHEN '埼玉県' THEN '11'
WHEN '千葉県' THEN '12'
WHEN '東京都' THEN '13'
WHEN '神奈川県' THEN '14'
END AS prefecture_cd
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | address | prefecture_cd |
---|---|---|
CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
CS037613000071 | 東京都江東区南砂********** | 13 |
CS031415000172 | 東京都渋谷区代々木********** | 13 |
CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
CS001215000145 | 東京都大田区仲六郷********** | 13 |
CS020401000016 | 東京都板橋区若木********** | 13 |
CS015414000103 | 東京都江東区北砂********** | 13 |
CS029403000008 | 千葉県浦安市海楽********** | 12 |
CS015804000004 | 東京都江東区北砂********** | 13 |
CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This SQL code selects thecustomer_id
,address
, and aprefecture_cd
code for each customer from thecustomer
table.
TheCASE
expression in this code extracts the prefecture name from theaddress
column using a regular expression pattern'^.*?[都道府県]'
and assigns a correspondingprefecture_cd
value based on the following conditions:
If the extracted prefecture name is '埼玉県', theprefecture_cd
value is set to '11'.
If the extracted prefecture name is '千葉県', theprefecture_cd
value is set to '12'.
If the extracted prefecture name is '東京都', theprefecture_cd
value is set to '13'.
If the extracted prefecture name is '神奈川県', theprefecture_cd
value is set to '14'.
This code is similar to the previous example, but it uses theSUBSTRING
function to extract the prefecture name from theaddress
column instead of theSUBSTR
function. The regular expression pattern'^.*?[都道府県]'
matches any string that starts with any characters (^.*?
) and ends with one of the characters都
,道
,府
, or県
. This code is useful when the address information in thecustomer
table is not standardized and includes variations in the format of prefecture names. The resultingprefecture_cd
column can be used for further analysis or grouping.
%%sql
-- Code example 3 (using LIKE)
SELECT
customer_id,
address,
CASE
WHEN address LIKE '埼玉県%' THEN '11'
WHEN address LIKE '千葉県%' THEN '12'
WHEN address LIKE '東京都%' THEN '13'
WHEN address LIKE '神奈川県%' THEN '14'
END AS prefecture_cd
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | address | prefecture_cd |
---|---|---|
CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
CS037613000071 | 東京都江東区南砂********** | 13 |
CS031415000172 | 東京都渋谷区代々木********** | 13 |
CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
CS001215000145 | 東京都大田区仲六郷********** | 13 |
CS020401000016 | 東京都板橋区若木********** | 13 |
CS015414000103 | 東京都江東区北砂********** | 13 |
CS029403000008 | 千葉県浦安市海楽********** | 12 |
CS015804000004 | 東京都江東区北砂********** | 13 |
CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This SQL code is performing a query on a table namedcustomer
and selecting thecustomer_id
,address
, andprefecture_cd
columns for the first 10 rows. Theprefecture_cd
column is created using aCASE
statement to match theaddress
column using theLIKE
operator.
Specifically, theCASE
statement checks theaddress
column using theLIKE
operator with a string pattern that matches the prefectures of Japan (埼玉県%
,千葉県%
,東京都%
,神奈川県%
) followed by any string characters. If theaddress
column matches any of the patterns, the corresponding prefecture code (11
,12
,13
,14
) is assigned to theprefecture_cd
column.
This code is an alternative way of extracting the prefecture code from theaddress
column, compared to the other two examples shown previously.
S-055: Sum the sales amount (amount) of the receipt details (df_receipt) data for each customer ID (customer_id) and find the quartile point of the total amount. Then, create category values for the total sales amount for each customer based on the following criteria, and display the 10 results together with the customer ID and the total sales amount. The category values shall be 1-4 in order.
- Minimum value and above and below the first quartile – Assign 1.
- First quartile and above but below the second quartile – Assign 2.
- Second quartile and above but below the third quartile – Assign 3.
- Third quartile and above – Assign 4.
%%sql
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
GROUP BY
customer_id
),
sales_pct AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY sum_amount) AS pct25,
PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY sum_amount) AS pct50,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY sum_amount) AS pct75
FROM
sales_amount
)
SELECT
a.customer_id,
a.sum_amount,
CASE
WHEN a.sum_amount < pct25 THEN 1
WHEN pct25 <= a.sum_amount AND a.sum_amount < pct50 THEN 2
WHEN pct50 <= a.sum_amount AND a.sum_amount < pct75 THEN 3
WHEN pct75 <= a.sum_amount THEN 4
END AS pct_group
FROM
sales_amount a
CROSS JOIN
sales_pct p
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount | pct_group |
---|---|---|
CS001311000059 | 2302 | 3 |
CS004614000122 | 248 | 1 |
CS003512000043 | 298 | 1 |
CS011615000061 | 246 | 1 |
CS029212000033 | 3604 | 3 |
CS007515000119 | 7157 | 4 |
CS034515000123 | 3699 | 4 |
CS004315000058 | 490 | 1 |
CS026414000014 | 6671 | 4 |
CS001615000099 | 768 | 2 |
Commentary :
This SQL code performs the following tasks:
It calculates the total sales amount for each customer by summing the amount column from the receipt table for each customer_id using theSUM
function and group by the customer_id. The resulting table is saved as a subquery namedsales_amount
.
It calculates the quartiles (25th, 50th, and 75th) of the sum_amount column from thesales_amount
subquery using thePERCENTILE_CONT
function, which is a window function.
It joins thesales_amount
subquery with thesales_pct
subquery using theCROSS JOIN
operation to obtain the quartile values for each row of thesales_amount
table.
Finally, it calculates a new columnpct_group
that assigns a percentile group to each customer based on their total sales amount, as follows:
1 if the sum_amount is less than the 25th percentile (pct25
)
2 if the sum_amount is greater than or equal to the 25th percentile and less than the 50th percentile (pct50
)
3 if the sum_amount is greater than or equal to the 50th percentile and less than the 75th percentile (pct75
)
4 if the sum_amount is greater than or equal to the 75th percentile.
The final result is limited to the first 10 rows.
S-056: Calculate the age in 10-year age increments based on the age (age) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id) and date of birth (birth_day). However, all age groups above 60 years should be assumed to be in the 60s. The name of the category representing the age is optional.
%%sql
SELECT
customer_id,
birth_day,
-- Items for confirmation
-- age,
LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60) AS era
FROM
customer
GROUP BY customer_id,
birth_day
-- Terms and conditions for verification
-- HAVING LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60) >= 60
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | birth_day | era |
---|---|---|
CS001105000001 | 2000-01-14 | 10 |
CS001112000009 | 2006-08-24 | 10 |
CS001112000019 | 2001-01-31 | 10 |
CS001112000021 | 2001-12-15 | 10 |
CS001112000023 | 2004-01-26 | 10 |
CS001112000024 | 2001-01-16 | 10 |
CS001112000029 | 2005-01-24 | 10 |
CS001112000030 | 2003-03-02 | 10 |
CS001113000004 | 2003-02-22 | 10 |
CS001113000010 | 2005-05-09 | 10 |
Commentary :
This code is selecting thecustomer_id
,birth_day
, andera
(a calculated field) from thecustomer
table. Theera
field is calculated by determining the customer's age (which is not explicitly selected, but is used in the calculation) and then taking the lower value between the age rounded down to the nearest decade and 60.
The results are grouped bycustomer_id
andbirth_day
. TheHAVING
clause is then used to filter out customers whose era is less than 60.
In essence, the code is selecting information about customers from thecustomer
table, calculating their age and era, grouping the results bycustomer_id
andbirth_day
, and filtering out customers whose era is less than 60.
S-057: Using the extraction results of 056 and the gender code (gender_cd), create new category data representing the combination of gender x age and display 10 cases. The value of the category representing the combination shall be arbitrary.
%%sql
-- Generate a sex and age code by concatenating a one-digit sex code and a two-digit age code.
SELECT
customer_id,
birth_day,
gender_cd || TO_CHAR(LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60), 'FM00') AS gender_era
FROM
customer
GROUP BY
customer_id,
birth_day
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | birth_day | gender_era |
---|---|---|
CS001105000001 | 2000-01-14 | 010 |
CS001112000009 | 2006-08-24 | 110 |
CS001112000019 | 2001-01-31 | 110 |
CS001112000021 | 2001-12-15 | 110 |
CS001112000023 | 2004-01-26 | 110 |
CS001112000024 | 2001-01-16 | 110 |
CS001112000029 | 2005-01-24 | 110 |
CS001112000030 | 2003-03-02 | 110 |
CS001113000004 | 2003-02-22 | 110 |
CS001113000010 | 2005-05-09 | 110 |
Commentary :
This SQL code selects customer_id, birth_day, and a concatenated string composed of gender_cd and era. The era is calculated by taking the age of the customer and rounding it down to the nearest decade, with a maximum era of 60. The gender_era string is generated by concatenating the gender_cd and era.
The SQL code groups the result by customer_id and birth_day, and returns only the first 10 rows of the result.
Overall, this code creates a summary of customer data that includes age and gender, with an additional era dimension. The resulting table could potentially be used for further analysis and segmentation of customers based on age and gender.
S-058: Create a dummy variable for the gender code (gender_cd) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id).
%%sql
-- Note that if there are many categories, the SQL can be very long.
-- If you want to reduce one category, simply delete one of the CASE statements.
SELECT
customer_id,
CASE WHEN gender_cd = '0' THEN '1' ELSE '0' END AS gender_cd_0,
CASE WHEN gender_cd = '1' THEN '1' ELSE '0' END AS gender_cd_1,
CASE WHEN gender_cd = '9' THEN '1' ELSE '0' END AS gender_cd_9
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | gender_cd_0 | gender_cd_1 | gender_cd_9 |
---|---|---|---|
CS021313000114 | 0 | 1 | 0 |
CS037613000071 | 0 | 0 | 1 |
CS031415000172 | 0 | 1 | 0 |
CS028811000001 | 0 | 1 | 0 |
CS001215000145 | 0 | 1 | 0 |
CS020401000016 | 1 | 0 | 0 |
CS015414000103 | 0 | 1 | 0 |
CS029403000008 | 1 | 0 | 0 |
CS015804000004 | 1 | 0 | 0 |
CS033513000180 | 0 | 1 | 0 |
Commentary :
This SQL code is selecting customer_id, and creating three new columns:gender_cd_0
,gender_cd_1
, andgender_cd_9
. These new columns are created using theCASE
statement.
TheCASE
statement is a conditional statement that evaluates a set of conditions and returns a corresponding result. In this case, for each customer, the code checks the value of thegender_cd
column. Ifgender_cd
is equal to '0', the code returns '1' in thegender_cd_0
column and '0' in the other columns. Ifgender_cd
is equal to '1', the code returns '1' in thegender_cd_1
column and '0' in the other columns. Ifgender_cd
is equal to '9', the code returns '1' in thegender_cd_9
column and '0' in the other columns.
The resulting output will have customer_id, gender_cd_0, gender_cd_1, and gender_cd_9 columns, where for each customer, only one of the new columns will have a value of '1', indicating the gender of the customer.
S-059: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), standardise the total sales amount to mean 0 and standard deviation 1 and display 10 cases with the customer ID and total sales amount. The standard deviation used for standardisation may be either the square root of the variance or the square root of the unbiased variance. However, exclude customer IDs starting with “Z”, as they represent non-members.
%%sql
-- Code example 1 (standardised with STDDEV_POP)
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
),
stats_amount AS (
SELECT
AVG(sum_amount) AS avg_amount,
STDDEV_POP(sum_amount) AS stddev_amount
FROM
sales_amount
)
SELECT
customer_id,
sum_amount,
(sum_amount - avg_amount) / stddev_amount AS std_amount
FROM
sales_amount
CROSS JOIN
stats_amount
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount | std_amount |
---|---|---|
CS001311000059 | 2302 | -0.09032946448390523062 |
CS004614000122 | 248 | -0.84533488878695858131 |
CS003512000043 | 298 | -0.82695598361891930792 |
CS011615000061 | 246 | -0.84607004499368015224 |
CS029212000033 | 3604 | 0.38825722609183744835 |
CS007515000119 | 7157 | 1.6942622273327082 |
CS034515000123 | 3699 | 0.42317714591111206778 |
CS004315000058 | 490 | -0.75638098777364849812 |
CS026414000014 | 6671 | 1.5156192690993665 |
CS001615000099 | 768 | -0.65419427503935013810 |
Commentary :
This code is written in SQL language and is using a database management system that supports SQL.
The code is performing a data analysis task on a table calledreceipt
. The table contains transactional data with columns such ascustomer_id
andamount
.
The code is using a common table expression (CTE) to create two temporary tablessales_amount
andstats_amount
.
The first CTE namedsales_amount
is calculating the total amount spent by each customer who has acustomer_id
that does not start with the letter 'Z'. It is grouping the transaction data bycustomer_id
and calculating the sum ofamount
for eachcustomer_id
.
The second CTE namedstats_amount
is calculating the average and standard deviation of the total amount spent by each customer.
The final SELECT statement is selecting columns from thesales_amount
table, and joining it with thestats_amount
table using a cross join. It is also calculating the standardized amount spent by each customer by subtracting the average amount from the total amount and then dividing it by the standard deviation.
TheLIMIT 10
at the end limits the output to the first 10 rows.
Overall, this code is performing a statistical analysis on the transactional data to find the standardized amount spent by each customer who doesn't have acustomer_id
starting with the letter 'Z'.
%%sql
-- Code example 2 (standardised with STDDEV_SAMP, values slightly different from code example 2)
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
),
stats_amount AS (
SELECT
AVG(sum_amount) AS avg_amount,
STDDEV_SAMP(sum_amount) AS stddev_amount
FROM
sales_amount
)
SELECT
customer_id,
sum_amount,
(sum_amount - avg_amount) / stddev_amount AS std_amount
FROM
sales_amount
CROSS JOIN
stats_amount
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount | std_amount |
---|---|---|
CS001311000059 | 2302 | -0.09032402671702291270 |
CS004614000122 | 248 | -0.84528400025253654164 |
CS003512000043 | 298 | -0.82690620148098070841 |
CS011615000061 | 246 | -0.84601911220339877497 |
CS029212000033 | 3604 | 0.38823385329429098451 |
CS007515000119 | 7157 | 1.6941602340010485 |
CS034515000123 | 3699 | 0.42315167096024706764 |
CS004315000058 | 490 | -0.75633545419820630882 |
CS026414000014 | 6671 | 1.5155280299415258 |
CS001615000099 | 768 | -0.65415489302835587608 |
Commentary :
This code is very similar to the previous one, but there is one important difference in thestats_amount
CTE.
Instead of using theSTDDEV_POP
function to calculate the standard deviation, it is using theSTDDEV_SAMP
function.STDDEV_SAMP
calculates the sample standard deviation, which is an estimate of the population standard deviation based on a sample of data.
In contrast,STDDEV_POP
calculates the exact population standard deviation, assuming that the entire population is available for analysis.
So, the output of this code will be similar to the previous one, but the standard deviation calculation will be based on a sample of the data rather than the entire population.
In general, if the dataset is very large, the difference between the sample and population standard deviation will be small. However, if the dataset is small, the difference can be significant.
S-060: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), normalise the total sales amount to a minimum value of 0 and a maximum value of 1 and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
%%sql
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
),
stats_amount AS (
SELECT
MAX(sum_amount) AS max_amount,
MIN(sum_amount) AS min_amount
FROM
sales_amount
)
SELECT
customer_id,
sum_amount,
1.0 * (sum_amount - min_amount) / (max_amount - min_amount) AS scale_amount
FROM
sales_amount
CROSS JOIN
stats_amount
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount | scale_amount |
---|---|---|
CS001311000059 | 2302 | 0.09697601668404588113 |
CS004614000122 | 248 | 0.00773375043448036149 |
CS003512000043 | 298 | 0.00990615224191866528 |
CS011615000061 | 246 | 0.00764685436218282934 |
CS029212000033 | 3604 | 0.15354535974973931178 |
CS007515000119 | 7157 | 0.30791623218630517901 |
CS034515000123 | 3699 | 0.15767292318387208898 |
CS004315000058 | 490 | 0.01824817518248175182 |
CS026414000014 | 6671 | 0.28680048661800486618 |
CS001615000099 | 768 | 0.03032672923183872089 |
Commentary :
This code is also analyzing thereceipt
table, but instead of calculating the standardized amount, it is calculating the scaled amount spent by each customer.
The code is using two CTEs,sales_amount
andstats_amount
, which are similar to the previous code.
The first CTE calculates the total amount spent by each customer who has acustomer_id
that does not start with the letter 'Z', and the second CTE calculates the maximum and minimum total amount spent by any customer.
The final SELECT statement selects columns from thesales_amount
table and joins it with thestats_amount
table using a cross join.
It also calculates the scaled amount spent by each customer, which is calculated by subtracting the minimum amount spent from the total amount spent and then dividing it by the range of total amount spent, which is the difference between the maximum and minimum amount spent.
The1.0 *
before the expression is used to convert the result to a floating-point number, which is necessary if any of the values used in the expression are floating-point numbers.
TheLIMIT 10
at the end limits the output to the first 10 rows.
Overall, this code is performing a data analysis task to find the scaled amount spent by each customer who doesn't have acustomer_id
starting with the letter 'Z'. The scaled amount is a value between 0 and 1 that represents the proportion of the total amount spent by each customer relative to the range of total amount spent by all customers.
Comment