参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- The database is PostgreSQL13
- First, execute the following cell
- You can issue SQL by writing %%sql in the cell
- You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
- You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
- It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
- If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
- Names, addresses etc. are dummy data and do not exist.
%load_ext sql
import os
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)
# Set up Magic commands to write SQL
%sql $dsl
'Connected: padawan@dsdojo_db'
How to use
- You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.
%%sql
SELECT 'You can run it like this' AS sample;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
| sample |
|---|
| You can run it like this |
100 data processing knocks
S-041: Sum up the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd) and calculate the increase or decrease in sales amount from the date of the last sale. And display 10 results.
%%sql
WITH sales_amount_by_date AS (
SELECT
sales_ymd,
SUM(amount) AS amount
FROM
receipt
GROUP BY
sales_ymd
),
sales_amount_by_date_with_lag as (
SELECT
sales_ymd,
LAG(sales_ymd, 1) OVER(ORDER BY sales_ymd) lag_ymd,
amount,
LAG(amount, 1) OVER(ORDER BY sales_ymd) AS lag_amount
FROM
sales_amount_by_date
)
SELECT
sales_ymd,
amount,
lag_ymd,
lag_amount,
amount - lag_amount AS diff_amount
FROM
sales_amount_by_date_with_lag
ORDER BY
sales_ymd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| sales_ymd | amount | lag_ymd | lag_amount | diff_amount |
|---|---|---|---|---|
| 20170101 | 33723 | None | None | None |
| 20170102 | 24165 | 20170101 | 33723 | -9558 |
| 20170103 | 27503 | 20170102 | 24165 | 3338 |
| 20170104 | 36165 | 20170103 | 27503 | 8662 |
| 20170105 | 37830 | 20170104 | 36165 | 1665 |
| 20170106 | 32387 | 20170105 | 37830 | -5443 |
| 20170107 | 23415 | 20170106 | 32387 | -8972 |
| 20170108 | 24737 | 20170107 | 23415 | 1322 |
| 20170109 | 26718 | 20170108 | 24737 | 1981 |
| 20170110 | 20143 | 20170109 | 26718 | -6575 |
Commentary :
This code is written in SQL language and is using the "WITH" clause to create two temporary tables named "sales_amount_by_date" and "sales_amount_by_date_with_lag". The code is then selecting specific columns from the "sales_amount_by_date_with_lag" table and calculating the difference in sales amount between consecutive dates using the "LAG" function.
More specifically, the code is performing the following operations:
The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.
The second "WITH" clause creates another temporary table called "sales_amount_by_date_with_lag" that uses the "LAG" function to calculate the previous date's sales amount and the difference in sales amount between consecutive dates.
The final "SELECT" statement selects columns from the "sales_amount_by_date_with_lag" table, including the sales date, sales amount, previous date, previous sales amount, and the difference in sales amount between consecutive dates.
The results are then sorted in ascending order by sales date and limited to the first 10 rows.
In summary, the code calculates the difference in sales amount between consecutive dates in the "receipt" table using the "LAG" function and displays the results for the first 10 rows.
S-042: Tabulate the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd), and for each date, combine the data of the last, previous and three previous sales dates. Then display 10 results.
%%sql
-- Code example 1: Vertical holding case
WITH sales_amount_by_date AS (
SELECT
sales_ymd,
SUM(amount) AS amount
FROM
receipt
GROUP BY
sales_ymd
),
sales_amount_lag_date AS (
SELECT
sales_ymd,
LAG(sales_ymd, 3) OVER (ORDER BY sales_ymd) AS lag_date_3,
amount
FROM
sales_amount_by_date
)
SELECT
a.sales_ymd,
a.amount,
b.sales_ymd AS lag_ymd,
b.amount AS lag_amount
FROM
sales_amount_lag_date a
JOIN
sales_amount_lag_date b
ON
(
a.lag_date_3 IS NULL
OR a.lag_date_3 <= b.sales_ymd
)
AND b.sales_ymd < a.sales_ymd
ORDER BY
sales_ymd,
lag_ymd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| sales_ymd | amount | lag_ymd | lag_amount |
|---|---|---|---|
| 20170102 | 24165 | 20170101 | 33723 |
| 20170103 | 27503 | 20170101 | 33723 |
| 20170103 | 27503 | 20170102 | 24165 |
| 20170104 | 36165 | 20170101 | 33723 |
| 20170104 | 36165 | 20170102 | 24165 |
| 20170104 | 36165 | 20170103 | 27503 |
| 20170105 | 37830 | 20170102 | 24165 |
| 20170105 | 37830 | 20170103 | 27503 |
| 20170105 | 37830 | 20170104 | 36165 |
| 20170106 | 32387 | 20170103 | 27503 |
Commentary :
This code is also written in SQL language and is using the "WITH" clause to create two temporary tables named "sales_amount_by_date" and "sales_amount_lag_date". The code is then performing a self-join on the "sales_amount_lag_date" table and filtering the results based on certain conditions.
More specifically, the code is performing the following operations:
The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.
The second "WITH" clause creates another temporary table called "sales_amount_lag_date" that uses the "LAG" function to calculate the sales date three days prior and the sales amount for each date in the "sales_amount_by_date" table.
The final "SELECT" statement selects specific columns from the "sales_amount_lag_date" table and performs a self-join with itself. The join condition filters the results to include only rows where the sales date of one row is less than the sales date of another row and where the sales date of the second row is less than or equal to the sales date of the first row three days prior.
The results are then sorted in ascending order by sales date and previous sales date and limited to the first 10 rows.
In summary, the code is performing a self-join on a table that includes the sales date and sales amount for each date in the "receipt" table. The join condition filters the results to include only rows where the sales date of one row is less than the sales date of another row and where the sales date of the second row is less than or equal to the sales date of the first row three days prior. The results display the sales date, sales amount, previous sales date, and previous sales amount for the first 10 rows.
%%sql
-- Code example 2: horizontal holding case
WITH sales_amount_by_date AS (
SELECT
sales_ymd,
SUM(amount) AS amount
FROM
receipt
GROUP BY
sales_ymd
),
sales_amount_with_lag AS (
SELECT
sales_ymd,
amount,
LAG(sales_ymd, 1) OVER (ORDER BY sales_ymd) AS lag_ymd_1,
LAG(amount, 1) OVER (ORDER BY sales_ymd) AS lag_amount_1,
LAG(sales_ymd, 2) OVER (ORDER BY sales_ymd) AS lag_ymd_2,
LAG(amount, 2) OVER (ORDER BY sales_ymd) AS lag_amount_2,
LAG(sales_ymd, 3) OVER (ORDER BY sales_ymd) AS lag_ymd_3,
LAG(amount, 3) OVER (ORDER BY sales_ymd) AS lag_amount_3
FROM
sales_amount_by_date
)
SELECT
sales_ymd,
amount,
lag_ymd_1,
lag_amount_1,
lag_ymd_2,
lag_amount_2,
lag_ymd_3,
lag_amount_3
FROM
sales_amount_with_lag
WHERE
lag_ymd_3 IS NOT NULL
ORDER BY
sales_ymd
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| sales_ymd | amount | lag_ymd_1 | lag_amount_1 | lag_ymd_2 | lag_amount_2 | lag_ymd_3 | lag_amount_3 |
|---|---|---|---|---|---|---|---|
| 20170104 | 36165 | 20170103 | 27503 | 20170102 | 24165 | 20170101 | 33723 |
| 20170105 | 37830 | 20170104 | 36165 | 20170103 | 27503 | 20170102 | 24165 |
| 20170106 | 32387 | 20170105 | 37830 | 20170104 | 36165 | 20170103 | 27503 |
| 20170107 | 23415 | 20170106 | 32387 | 20170105 | 37830 | 20170104 | 36165 |
| 20170108 | 24737 | 20170107 | 23415 | 20170106 | 32387 | 20170105 | 37830 |
| 20170109 | 26718 | 20170108 | 24737 | 20170107 | 23415 | 20170106 | 32387 |
| 20170110 | 20143 | 20170109 | 26718 | 20170108 | 24737 | 20170107 | 23415 |
| 20170111 | 24287 | 20170110 | 20143 | 20170109 | 26718 | 20170108 | 24737 |
| 20170112 | 23526 | 20170111 | 24287 | 20170110 | 20143 | 20170109 | 26718 |
| 20170113 | 28004 | 20170112 | 23526 | 20170111 | 24287 | 20170110 | 20143 |
Commentary :
This code is also written in SQL language and is using the "WITH" clause to create a temporary table named "sales_amount_by_date". The code is then using the "LAG" function to create additional columns in another temporary table named "sales_amount_with_lag". Finally, the code is selecting specific columns from the "sales_amount_with_lag" table and filtering the results based on a certain condition.
More specifically, the code is performing the following operations:
The first "WITH" clause creates a temporary table called "sales_amount_by_date" that calculates the total sales amount for each date in the "receipt" table.
The second "WITH" clause creates another temporary table called "sales_amount_with_lag" that uses the "LAG" function to calculate the sales date and sales amount for the previous three days for each date in the "sales_amount_by_date" table.
The final "SELECT" statement selects specific columns from the "sales_amount_with_lag" table and filters the results to include only rows where the third previous date is not null. This means that the first two rows of the table will be excluded since they do not have a third previous date.
The results are then sorted in ascending order by sales date and limited to the first 10 rows.
In summary, the code is calculating the total sales amount for each date in the "receipt" table and then using the "LAG" function to calculate the sales date and sales amount for the previous three days for each date. The results display the sales date and sales amount for each date and the sales date and sales amount for the previous three days. The results are filtered to exclude the first two rows since they do not have a third previous date. The first 10 rows of the results are displayed in ascending order by sales date.
S-043: Combine receipt details data (df_receipt) and customer data (df_customer) to create sales summary data, summing the sales amount (amount) by gender code (gender_cd) and age (calculated from age). Gender_cd should be 0 for male, 1 for female and 9 for unknown.
However, the item structure should consist of four items: age, female sales amount, male sales amount and sales amount of unknown gender (vertical cross tabulation of age and horizontal cross tabulation of gender). In addition, the age shall be in deciles of 10 years.
%%sql
DROP TABLE IF EXISTS sales_summary;
CREATE TABLE sales_summary AS
WITH gender_era_amount AS (
SELECT
TRUNC(age / 10) * 10 AS era,
c.gender_cd,
SUM(r.amount) AS amount
FROM
customer c
JOIN
receipt r
ON
c.customer_id = r.customer_id
GROUP BY
era,
c.gender_cd
)
SELECT
era,
SUM(CASE WHEN gender_cd = '0' THEN amount END) AS male,
SUM(CASE WHEN gender_cd = '1' THEN amount END) AS female,
SUM(CASE WHEN gender_cd = '9' THEN amount END) AS unknown
FROM
gender_era_amount
GROUP BY
era
ORDER BY
era
;
SELECT
*
FROM
sales_summary
;
* postgresql://padawan:***@db:5432/dsdojo_db Done. 9 rows affected. 9 rows affected.
| era | male | female | unknown |
|---|---|---|---|
| 10.0 | 1591 | 149836 | 4317 |
| 20.0 | 72940 | 1363724 | 44328 |
| 30.0 | 177322 | 693047 | 50441 |
| 40.0 | 19355 | 9320791 | 483512 |
| 50.0 | 54320 | 6685192 | 342923 |
| 60.0 | 272469 | 987741 | 71418 |
| 70.0 | 13435 | 29764 | 2427 |
| 80.0 | 46360 | 262923 | 5111 |
| 90.0 | None | 6260 | None |
Commentary :
This SQL code creates a table namedsales_summarythat summarizes the total sales amount by age group (era) and gender.
First, a Common Table Expression (CTE) namedgender_era_amountis created using a join between thecustomerandreceipttables. The CTE calculates the total sales amount by grouping the records based on age group (era) and gender.
Next, thesales_summarytable is created by aggregating the data from thegender_era_amountCTE. The table has four columns:era,male,female, andunknown. TheSUM(CASE...)statements within the SELECT clause are used to pivot the gender-specific sales amounts into separate columns.
Finally, the SELECT statement is used to retrieve all the rows from the newly createdsales_summarytable.
S-044: The sales summary data (df_sales_summary) created in 043 had sales by gender carried horizontally. From this data, convert the gender into three items: age, gender code and sales amount, by having the gender carried vertically. The gender code should be “00” for male, “01” for female and “99” for unknown.
%%sql
-- somewhat aggressive description as it is not SQL-oriented (note that it can be very long SQL when the number of categories is large).
SELECT
era,
'00' AS gender_cd,
male AS amount
FROM
sales_summary
UNION ALL
SELECT
era,
'01' AS gender_cd,
female AS amount
FROM
sales_summary
UNION ALL
SELECT
era,
'99' AS gender_cd,
unknown AS amount
FROM
sales_summary
;
* postgresql://padawan:***@db:5432/dsdojo_db 27 rows affected.
| era | gender_cd | amount |
|---|---|---|
| 10.0 | 00 | 1591 |
| 20.0 | 00 | 72940 |
| 30.0 | 00 | 177322 |
| 40.0 | 00 | 19355 |
| 50.0 | 00 | 54320 |
| 60.0 | 00 | 272469 |
| 70.0 | 00 | 13435 |
| 80.0 | 00 | 46360 |
| 90.0 | 00 | None |
| 10.0 | 01 | 149836 |
| 20.0 | 01 | 1363724 |
| 30.0 | 01 | 693047 |
| 40.0 | 01 | 9320791 |
| 50.0 | 01 | 6685192 |
| 60.0 | 01 | 987741 |
| 70.0 | 01 | 29764 |
| 80.0 | 01 | 262923 |
| 90.0 | 01 | 6260 |
| 10.0 | 99 | 4317 |
| 20.0 | 99 | 44328 |
| 30.0 | 99 | 50441 |
| 40.0 | 99 | 483512 |
| 50.0 | 99 | 342923 |
| 60.0 | 99 | 71418 |
| 70.0 | 99 | 2427 |
| 80.0 | 99 | 5111 |
| 90.0 | 99 | None |
Commentary :
This code is written in SQL and retrieves data from a table calledsales_summary. The purpose of this code is to transform the data in thesales_summarytable to a format that can be more easily analyzed.
TheSELECTstatement is used to retrieve data from thesales_summarytable, and theUNION ALLoperator is used to combine the results of multipleSELECTstatements into a single result set.
The firstSELECTstatement retrieves data from thesales_summarytable and selects theeraandmalecolumns. It also includes a new column calledgender_cd, which has a value of00. This column is added using the string literal'00' AS gender_cd. TheASkeyword is used to rename the column togender_cd. The result set of thisSELECTstatement contains three columns:era,gender_cd, andamount.
The secondSELECTstatement is similar to the first one, but it selects theeraandfemalecolumns and sets thegender_cdcolumn to01. The result set of thisSELECTstatement also contains three columns:era,gender_cd, andamount.
The thirdSELECTstatement is also similar to the first two, but it selects theeraandunknowncolumns and sets thegender_cdcolumn to99. The result set of thisSELECTstatement also contains three columns:era,gender_cd, andamount.
Finally, theUNION ALLoperator is used to combine the result sets of the threeSELECTstatements into a single result set. The resulting table will have one row for each combination oferaandgender_cd, and theamountcolumn will contain the value of eithermale,female, orunknown, depending on the value ofgender_cd.
S-045: The date of birth (birth_day) in the customer data (df_customer) is held in date type. Convert this to a string in YYYYMMDD format and display 10 cases together with the customer ID (customer_id).
%%sql
SELECT
customer_id,
TO_CHAR(birth_day, 'YYYYMMDD') AS birth_day
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | birth_day |
|---|---|
| CS021313000114 | 19810429 |
| CS037613000071 | 19520401 |
| CS031415000172 | 19761004 |
| CS028811000001 | 19330327 |
| CS001215000145 | 19950329 |
| CS020401000016 | 19740915 |
| CS015414000103 | 19770809 |
| CS029403000008 | 19730817 |
| CS015804000004 | 19310502 |
| CS033513000180 | 19620711 |
Commentary :
This code is written in SQL and retrieves data from a table calledcustomer. The purpose of this code is to retrieve thecustomer_idandbirth_daycolumns from thecustomertable and format thebirth_daycolumn in a specific way.
TheSELECTstatement is used to retrieve data from thecustomertable, and theLIMITclause is used to limit the number of rows returned to 10.
TheTO_CHARfunction is used to convert thebirth_daycolumn to a string with the specified format. In this case, the format is'YYYYMMDD', which represents the year, month, and day of the birth date. The resulting string will have a length of 8 characters, with the year represented by 4 characters, the month by 2 characters, and the day by 2 characters.
TheASkeyword is used to rename the resulting column tobirth_day. The result set of thisSELECTstatement contains two columns:customer_idandbirth_day. Thebirth_daycolumn contains the birth date of each customer in the specified format.
Overall, this code is useful when you want to retrieve thecustomer_idandbirth_daycolumns from thecustomertable and format thebirth_daycolumn in a specific way. This formatted data can be useful for analysis or reporting purposes.
S-046: The application date (application_date) of the customer data (df_customer) holds data in a string type in YYYYMMDD format. Convert this to a date type and display 10 cases together with the customer ID (customer_id).
%%sql
SELECT
customer_id,
TO_DATE(application_date, 'YYYYMMDD') AS application_date
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | application_date |
|---|---|
| CS021313000114 | 2015-09-05 |
| CS037613000071 | 2015-04-14 |
| CS031415000172 | 2015-05-29 |
| CS028811000001 | 2016-01-15 |
| CS001215000145 | 2017-06-05 |
| CS020401000016 | 2015-02-25 |
| CS015414000103 | 2015-07-22 |
| CS029403000008 | 2015-05-15 |
| CS015804000004 | 2015-06-07 |
| CS033513000180 | 2015-07-28 |
Commentary :
This code is written in SQL and retrieves data from a table calledcustomer. The purpose of this code is to retrieve thecustomer_idandapplication_datecolumns from thecustomertable and convert theapplication_datecolumn to a date format.
TheSELECTstatement is used to retrieve data from thecustomertable, and theLIMITclause is used to limit the number of rows returned to 10.
TheTO_DATEfunction is used to convert theapplication_datecolumn from a string to a date format. In this case, the format is'YYYYMMDD', which represents the year, month, and day of the application date. The resulting date value will be in the formatyyyy-mm-dd.
TheASkeyword is used to rename the resulting column toapplication_date. The result set of thisSELECTstatement contains two columns:customer_idandapplication_date. Theapplication_datecolumn contains the application date of each customer in the date format.
Overall, this code is useful when you want to retrieve thecustomer_idandapplication_datecolumns from thecustomertable and convert theapplication_datecolumn to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.
S-047: The sales date (sales_ymd) of the receipt details data (df_receipt) holds data in numeric type in YYYYMMDD format. Convert this to a date type and display 10 items together with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
%%sql
SELECT
receipt_no,
receipt_sub_no,
TO_DATE(CAST(sales_ymd AS VARCHAR), 'YYYYMMDD') AS sales_ymd
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| receipt_no | receipt_sub_no | sales_ymd |
|---|---|---|
| 112 | 1 | 2018-11-03 |
| 1132 | 2 | 2018-11-18 |
| 1102 | 1 | 2017-07-12 |
| 1132 | 1 | 2019-02-05 |
| 1102 | 2 | 2018-08-21 |
| 1112 | 1 | 2019-06-05 |
| 1102 | 2 | 2018-12-05 |
| 1102 | 1 | 2019-09-22 |
| 1112 | 2 | 2017-05-04 |
| 1102 | 1 | 2019-10-10 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt. The purpose of this code is to retrieve thereceipt_no,receipt_sub_no, andsales_ymdcolumns from thereceipttable and convert thesales_ymdcolumn to a date format.
TheSELECTstatement is used to retrieve data from thereceipttable, and theLIMITclause is used to limit the number of rows returned to 10.
TheCASTfunction is used to convert thesales_ymdcolumn from an integer to a string. The resulting string will contain the year, month, and day of the sales date in the format'YYYYMMDD'.
TheTO_DATEfunction is used to convert thesales_ymdstring to a date format. In this case, the format is'YYYYMMDD', which represents the year, month, and day of the sales date. The resulting date value will be in the formatyyyy-mm-dd.
TheASkeyword is used to rename the resulting column tosales_ymd. The result set of thisSELECTstatement contains three columns:receipt_no,receipt_sub_no, andsales_ymd. Thesales_ymdcolumn contains the sales date of each receipt in the date format.
Overall, this code is useful when you want to retrieve thereceipt_no,receipt_sub_no, andsales_ymdcolumns from thereceipttable and convert thesales_ymdcolumn to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.
S-048: The sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) holds data in UNIX seconds of numeric type. Convert it to date type and display 10 cases together with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
%%sql
SELECT
receipt_no,
receipt_sub_no,
CAST(TO_TIMESTAMP(sales_epoch) AS DATE) AS sales_ymd
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| receipt_no | receipt_sub_no | sales_ymd |
|---|---|---|
| 112 | 1 | 2018-11-03 |
| 1132 | 2 | 2018-11-18 |
| 1102 | 1 | 2017-07-12 |
| 1132 | 1 | 2019-02-05 |
| 1102 | 2 | 2018-08-21 |
| 1112 | 1 | 2019-06-05 |
| 1102 | 2 | 2018-12-05 |
| 1102 | 1 | 2019-09-22 |
| 1112 | 2 | 2017-05-04 |
| 1102 | 1 | 2019-10-10 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt. The purpose of this code is to retrieve thereceipt_no,receipt_sub_no, andsales_epochcolumns from thereceipttable and convert thesales_epochcolumn to a date format.
TheSELECTstatement is used to retrieve data from thereceipttable, and theLIMITclause is used to limit the number of rows returned to 10.
TheTO_TIMESTAMPfunction is used to convert thesales_epochcolumn from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.
TheCASTfunction is used to convert the timestamp to a date format. In this case, the resulting date value will be in the formatyyyy-mm-dd.
TheASkeyword is used to rename the resulting column tosales_ymd. The result set of thisSELECTstatement contains three columns:receipt_no,receipt_sub_no, andsales_ymd. Thesales_ymdcolumn contains the sales date of each receipt in the date format.
Overall, this code is useful when you want to retrieve thereceipt_no,receipt_sub_no, andsales_epochcolumns from thereceipttable and convert thesales_epochcolumn to a date format. This formatted data can be useful for analysis or reporting purposes, especially when date calculations or comparisons are involved.
S-049: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “年(year)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
%%sql
SELECT
receipt_no,
receipt_sub_no,
EXTRACT(YEAR FROM TO_TIMESTAMP(sales_epoch)) AS sales_year
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| receipt_no | receipt_sub_no | sales_year |
|---|---|---|
| 112 | 1 | 2018 |
| 1132 | 2 | 2018 |
| 1102 | 1 | 2017 |
| 1132 | 1 | 2019 |
| 1102 | 2 | 2018 |
| 1112 | 1 | 2019 |
| 1102 | 2 | 2018 |
| 1102 | 1 | 2019 |
| 1112 | 2 | 2017 |
| 1102 | 1 | 2019 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt. The purpose of this code is to retrieve thereceipt_no,receipt_sub_no, andsales_epochcolumns from thereceipttable and extract the year from thesales_epochtimestamp column.
TheSELECTstatement is used to retrieve data from thereceipttable, and theLIMITclause is used to limit the number of rows returned to 10.
TheTO_TIMESTAMPfunction is used to convert thesales_epochcolumn from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.
TheEXTRACTfunction is used to extract the year from thesales_epochtimestamp. The resulting value will be an integer representing the year.
TheASkeyword is used to rename the resulting column tosales_year. The result set of thisSELECTstatement contains three columns:receipt_no,receipt_sub_no, andsales_year. Thesales_yearcolumn contains the year of the sales date of each receipt.
Overall, this code is useful when you want to retrieve thereceipt_no,receipt_sub_no, andsales_epochcolumns from thereceipttable and extract the year from thesales_epochtimestamp column. This extracted data can be useful for grouping or filtering data based on the year of the sales date.
S-050: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “月(month)” and display 10 items with receipt number (receipt_no) and receipt sub number (receipt_sub_no). Note that the “月(month)” should be retrieved with two digits filled with zero.
%%sql
SELECT
receipt_no,
receipt_sub_no,
TO_CHAR(
EXTRACT(MONTH FROM TO_TIMESTAMP(sales_epoch)),
'FM00'
) AS sales_month
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| receipt_no | receipt_sub_no | sales_month |
|---|---|---|
| 112 | 1 | 11 |
| 1132 | 2 | 11 |
| 1102 | 1 | 07 |
| 1132 | 1 | 02 |
| 1102 | 2 | 08 |
| 1112 | 1 | 06 |
| 1102 | 2 | 12 |
| 1102 | 1 | 09 |
| 1112 | 2 | 05 |
| 1102 | 1 | 10 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt. The purpose of this code is to retrieve thereceipt_no,receipt_sub_no, andsales_epochcolumns from thereceipttable and extract the month from thesales_epochtimestamp column, then format the resulting month as a two-digit string with a leading zero.
TheSELECTstatement is used to retrieve data from thereceipttable, and theLIMITclause is used to limit the number of rows returned to 10.
TheTO_TIMESTAMPfunction is used to convert thesales_epochcolumn from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.
TheEXTRACTfunction is used to extract the month from thesales_epochtimestamp. The resulting value will be an integer representing the month.
TheTO_CHARfunction is used to convert the extracted month to a string with a leading zero if necessary. The 'FM00' format string specifies that the resulting string should have a fixed width of two characters, with any leading zeros removed.
TheASkeyword is used to rename the resulting column tosales_month. The result set of thisSELECTstatement contains three columns:receipt_no,receipt_sub_no, andsales_month. Thesales_monthcolumn contains the month of the sales date of each receipt as a two-digit string with a leading zero.
Overall, this code is useful when you want to retrieve thereceipt_no,receipt_sub_no, andsales_epochcolumns from thereceipttable and extract the month from thesales_epochtimestamp column, then format the resulting month as a two-digit string with a leading zero. This formatted data can be useful for grouping or filtering data based on the month of the sales date.
S-051: Convert the sales epoch seconds of the receipt details data (df_receipt) to date type, extract only the “日(day)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no). Note that the “日(day)” should be extracted with two digits filled with zero.
%%sql
SELECT
receipt_no,
receipt_sub_no,
TO_CHAR(EXTRACT(DAY FROM TO_TIMESTAMP(sales_epoch)), 'FM00') AS sales_day
FROM
receipt
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| receipt_no | receipt_sub_no | sales_day |
|---|---|---|
| 112 | 1 | 03 |
| 1132 | 2 | 18 |
| 1102 | 1 | 12 |
| 1132 | 1 | 05 |
| 1102 | 2 | 21 |
| 1112 | 1 | 05 |
| 1102 | 2 | 05 |
| 1102 | 1 | 22 |
| 1112 | 2 | 04 |
| 1102 | 1 | 10 |
Commentary :
This code is written in SQL and retrieves data from a table calledreceipt. The purpose of this code is to retrieve thereceipt_no,receipt_sub_no, andsales_epochcolumns from thereceipttable and extract the day from thesales_epochtimestamp column, then format the resulting day as a two-digit string with a leading zero.
TheSELECTstatement is used to retrieve data from thereceipttable, and theLIMITclause is used to limit the number of rows returned to 10.
TheTO_TIMESTAMPfunction is used to convert thesales_epochcolumn from an integer to a timestamp format. The resulting timestamp will represent the number of seconds since January 1, 1970, 00:00:00 UTC.
TheEXTRACTfunction is used to extract the day from thesales_epochtimestamp. The resulting value will be an integer representing the day.
TheTO_CHARfunction is used to convert the extracted day to a string with a leading zero if necessary. The 'FM00' format string specifies that the resulting string should have a fixed width of two characters, with any leading zeros removed.
TheASkeyword is used to rename the resulting column tosales_day. The result set of thisSELECTstatement contains three columns:receipt_no,receipt_sub_no, andsales_day. Thesales_daycolumn contains the day of the sales date of each receipt as a two-digit string with a leading zero.
Overall, this code is useful when you want to retrieve thereceipt_no,receipt_sub_no, andsales_epochcolumns from thereceipttable and extract the day from thesales_epochtimestamp column, then format the resulting day as a two-digit string with a leading zero. This formatted data can be useful for grouping or filtering data based on the day of the sales date.
S-052: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), binaryise the total sales amount to 0 for amounts less than 2,000 yen and 1 for amounts greater than 2,000 yen, and display 10 items with the customer ID and total sales amount. Display 10 items together with the customer ID and the total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
%%sql
SELECT
customer_id,
SUM(amount) AS sum_amount,
CASE WHEN SUM(amount) > 2000 THEN 1 ELSE 0 END AS sales_flg
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | sum_amount | sales_flg |
|---|---|---|
| CS001311000059 | 2302 | 1 |
| CS004614000122 | 248 | 0 |
| CS003512000043 | 298 | 0 |
| CS011615000061 | 246 | 0 |
| CS029212000033 | 3604 | 1 |
| CS007515000119 | 7157 | 1 |
| CS034515000123 | 3699 | 1 |
| CS004315000058 | 490 | 0 |
| CS026414000014 | 6671 | 1 |
| CS001615000099 | 768 | 0 |
Commentary :
This SQL code is selecting data from a table calledreceipt. The purpose of this code is to calculate the total sales amount for each customer in thereceipttable, and also to create a flag indicating if the total sales amount for each customer is greater than 2000 or not.
TheSELECTstatement retrieves thecustomer_idcolumn, the total sales amount for each customer calculated using theSUMfunction, and aCASEexpression that returns1if the total sales amount is greater than 2000 and0otherwise. The resulting column that contains the flag is namedsales_flg.
TheWHEREclause filters out any records where the customer ID starts with the letter 'Z'. This could be used to exclude certain customers from the calculation or flagging.
TheGROUP BYclause groups the results bycustomer_idso that the total sales amount and flag are calculated for each customer.
TheLIMITclause limits the output to the first 10 results.
Overall, this code is useful when you want to calculate the total sales amount for each customer in areceipttable and create a flag indicating if the total sales amount is greater than a certain threshold (in this case, 2000) or not. This could be useful for identifying high-spending customers or for other types of analysis where customer sales amounts need to be aggregated and analyzed.
S-053: Binarise the postcode (postal_cd) of the customer data (df_customer) to 1 for Tokyo (the first three digits are 100 to 209) and 0 for all others. Further, combine it with the receipt details data (df_receipt) and count the number of customers with sales results for the entire period, for each binary value created.
%%sql
WITH cust AS (
SELECT
customer_id,
postal_cd,
CASE WHEN CAST(SUBSTR(postal_cd, 1, 3) AS INTEGER) BETWEEN 100 AND 209 THEN 1 ELSE 0 END AS postal_flg
FROM
customer
),
rect AS(
SELECT DISTINCT
customer_id
FROM
receipt
)
SELECT
c.postal_flg,
COUNT(DISTINCT c.customer_id) AS customer_cnt
FROM
cust c
JOIN
rect r
USING
(customer_id)
GROUP BY
c.postal_flg
;
* postgresql://padawan:***@db:5432/dsdojo_db 2 rows affected.
| postal_flg | customer_cnt |
|---|---|
| 0 | 3906 |
| 1 | 4400 |
Commentary :
This SQL code contains a Common Table Expression (CTE) namedcust, which selects data from thecustomertable and calculates apostal_flgflag for each customer based on theirpostal_cd.
Thepostal_flgflag is calculated using aCASEexpression, which checks if the first three characters of thepostal_cdare between 100 and 209 (inclusive), and assigns a value of 1 if true and 0 if false. This flag could be used to group customers by postal code ranges, for example.
The second CTE namedrectselects distinctcustomer_ids from thereceipttable.
The finalSELECTstatement joins thecustandrectCTEs on thecustomer_idcolumn and groups the results by thepostal_flgcolumn. TheCOUNT(DISTINCT c.customer_id)function counts the number of distinct customers for eachpostal_flgvalue.
Overall, this code is useful when you want to group customers by their postal code ranges and count the number of distinct customers for each range, based on data stored in thecustomerandreceipttables.
S-054: The address of the customer data (df_customer) is one of 埼玉県(Saitama-prefecture), 千葉県(Chiba-prefecture), 東京都(Tokyo-Metropolitan) or 神奈川県(Kanagawa-prefecture). Create a code value for each prefecture, and display 10 cases together with the customer ID and address. The values should be 11 for 埼玉県(Saitama-prefecture), 12 for 千葉県(Chiba-prefecture), 13 for 東京都(Tokyo-Metropolitan) and 14 for 神奈川県(Kanagawa-prefecture).
%%sql
-- somewhat aggressive description as it is not SQL oriented (note that it becomes very long SQL when the number of categories is large).
-- Code example 1 (fixed and cut out)
SELECT
customer_id,
address,
CASE SUBSTR(address,1, 3)
WHEN '埼玉県' THEN '11'
WHEN '千葉県' THEN '12'
WHEN '東京都' THEN '13'
WHEN '神奈川' THEN '14'
END AS prefecture_cd
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | address | prefecture_cd |
|---|---|---|
| CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
| CS037613000071 | 東京都江東区南砂********** | 13 |
| CS031415000172 | 東京都渋谷区代々木********** | 13 |
| CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
| CS001215000145 | 東京都大田区仲六郷********** | 13 |
| CS020401000016 | 東京都板橋区若木********** | 13 |
| CS015414000103 | 東京都江東区北砂********** | 13 |
| CS029403000008 | 千葉県浦安市海楽********** | 12 |
| CS015804000004 | 東京都江東区北砂********** | 13 |
| CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This SQL code selects thecustomer_id,address, and aprefecture_cdcode for each customer from thecustomertable.
TheCASEexpression in this code checks the first three characters of theaddresscolumn and assigns a correspondingprefecture_cdvalue based on the following conditions:
If the first three characters of theaddresscolumn are '埼玉県', theprefecture_cdvalue is set to '11'.
If the first three characters of theaddresscolumn are '千葉県', theprefecture_cdvalue is set to '12'.
If the first three characters of theaddresscolumn are '東京都', theprefecture_cdvalue is set to '13'.
If the first three characters of theaddresscolumn are '神奈川', theprefecture_cdvalue is set to '14'.
This code is useful when you want to categorize customers by their prefecture (region) based on their address information stored in thecustomertable. The resultingprefecture_cdcolumn can be used for further analysis or grouping.
%%sql
-- Code example 2 (using regular expressions)
SELECT
customer_id,
address,
CASE SUBSTRING(address, '^.*?[都道府県]')
WHEN '埼玉県' THEN '11'
WHEN '千葉県' THEN '12'
WHEN '東京都' THEN '13'
WHEN '神奈川県' THEN '14'
END AS prefecture_cd
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | address | prefecture_cd |
|---|---|---|
| CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
| CS037613000071 | 東京都江東区南砂********** | 13 |
| CS031415000172 | 東京都渋谷区代々木********** | 13 |
| CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
| CS001215000145 | 東京都大田区仲六郷********** | 13 |
| CS020401000016 | 東京都板橋区若木********** | 13 |
| CS015414000103 | 東京都江東区北砂********** | 13 |
| CS029403000008 | 千葉県浦安市海楽********** | 12 |
| CS015804000004 | 東京都江東区北砂********** | 13 |
| CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This SQL code selects thecustomer_id,address, and aprefecture_cdcode for each customer from thecustomertable.
TheCASEexpression in this code extracts the prefecture name from theaddresscolumn using a regular expression pattern'^.*?[都道府県]'and assigns a correspondingprefecture_cdvalue based on the following conditions:
If the extracted prefecture name is '埼玉県', theprefecture_cdvalue is set to '11'.
If the extracted prefecture name is '千葉県', theprefecture_cdvalue is set to '12'.
If the extracted prefecture name is '東京都', theprefecture_cdvalue is set to '13'.
If the extracted prefecture name is '神奈川県', theprefecture_cdvalue is set to '14'.
This code is similar to the previous example, but it uses theSUBSTRINGfunction to extract the prefecture name from theaddresscolumn instead of theSUBSTRfunction. The regular expression pattern'^.*?[都道府県]'matches any string that starts with any characters (^.*?) and ends with one of the characters都,道,府, or県. This code is useful when the address information in thecustomertable is not standardized and includes variations in the format of prefecture names. The resultingprefecture_cdcolumn can be used for further analysis or grouping.
%%sql
-- Code example 3 (using LIKE)
SELECT
customer_id,
address,
CASE
WHEN address LIKE '埼玉県%' THEN '11'
WHEN address LIKE '千葉県%' THEN '12'
WHEN address LIKE '東京都%' THEN '13'
WHEN address LIKE '神奈川県%' THEN '14'
END AS prefecture_cd
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | address | prefecture_cd |
|---|---|---|
| CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
| CS037613000071 | 東京都江東区南砂********** | 13 |
| CS031415000172 | 東京都渋谷区代々木********** | 13 |
| CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
| CS001215000145 | 東京都大田区仲六郷********** | 13 |
| CS020401000016 | 東京都板橋区若木********** | 13 |
| CS015414000103 | 東京都江東区北砂********** | 13 |
| CS029403000008 | 千葉県浦安市海楽********** | 12 |
| CS015804000004 | 東京都江東区北砂********** | 13 |
| CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This SQL code is performing a query on a table namedcustomerand selecting thecustomer_id,address, andprefecture_cdcolumns for the first 10 rows. Theprefecture_cdcolumn is created using aCASEstatement to match theaddresscolumn using theLIKEoperator.
Specifically, theCASEstatement checks theaddresscolumn using theLIKEoperator with a string pattern that matches the prefectures of Japan (埼玉県%,千葉県%,東京都%,神奈川県%) followed by any string characters. If theaddresscolumn matches any of the patterns, the corresponding prefecture code (11,12,13,14) is assigned to theprefecture_cdcolumn.
This code is an alternative way of extracting the prefecture code from theaddresscolumn, compared to the other two examples shown previously.
S-055: Sum the sales amount (amount) of the receipt details (df_receipt) data for each customer ID (customer_id) and find the quartile point of the total amount. Then, create category values for the total sales amount for each customer based on the following criteria, and display the 10 results together with the customer ID and the total sales amount. The category values shall be 1-4 in order.
- Minimum value and above and below the first quartile – Assign 1.
- First quartile and above but below the second quartile – Assign 2.
- Second quartile and above but below the third quartile – Assign 3.
- Third quartile and above – Assign 4.
%%sql
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
GROUP BY
customer_id
),
sales_pct AS (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY sum_amount) AS pct25,
PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY sum_amount) AS pct50,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY sum_amount) AS pct75
FROM
sales_amount
)
SELECT
a.customer_id,
a.sum_amount,
CASE
WHEN a.sum_amount < pct25 THEN 1
WHEN pct25 <= a.sum_amount AND a.sum_amount < pct50 THEN 2
WHEN pct50 <= a.sum_amount AND a.sum_amount < pct75 THEN 3
WHEN pct75 <= a.sum_amount THEN 4
END AS pct_group
FROM
sales_amount a
CROSS JOIN
sales_pct p
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | sum_amount | pct_group |
|---|---|---|
| CS001311000059 | 2302 | 3 |
| CS004614000122 | 248 | 1 |
| CS003512000043 | 298 | 1 |
| CS011615000061 | 246 | 1 |
| CS029212000033 | 3604 | 3 |
| CS007515000119 | 7157 | 4 |
| CS034515000123 | 3699 | 4 |
| CS004315000058 | 490 | 1 |
| CS026414000014 | 6671 | 4 |
| CS001615000099 | 768 | 2 |
Commentary :
This SQL code performs the following tasks:
It calculates the total sales amount for each customer by summing the amount column from the receipt table for each customer_id using theSUMfunction and group by the customer_id. The resulting table is saved as a subquery namedsales_amount.
It calculates the quartiles (25th, 50th, and 75th) of the sum_amount column from thesales_amountsubquery using thePERCENTILE_CONTfunction, which is a window function.
It joins thesales_amountsubquery with thesales_pctsubquery using theCROSS JOINoperation to obtain the quartile values for each row of thesales_amounttable.
Finally, it calculates a new columnpct_groupthat assigns a percentile group to each customer based on their total sales amount, as follows:
1 if the sum_amount is less than the 25th percentile (pct25)
2 if the sum_amount is greater than or equal to the 25th percentile and less than the 50th percentile (pct50)
3 if the sum_amount is greater than or equal to the 50th percentile and less than the 75th percentile (pct75)
4 if the sum_amount is greater than or equal to the 75th percentile.
The final result is limited to the first 10 rows.
S-056: Calculate the age in 10-year age increments based on the age (age) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id) and date of birth (birth_day). However, all age groups above 60 years should be assumed to be in the 60s. The name of the category representing the age is optional.
%%sql
SELECT
customer_id,
birth_day,
-- Items for confirmation
-- age,
LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60) AS era
FROM
customer
GROUP BY customer_id,
birth_day
-- Terms and conditions for verification
-- HAVING LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60) >= 60
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | birth_day | era |
|---|---|---|
| CS001105000001 | 2000-01-14 | 10 |
| CS001112000009 | 2006-08-24 | 10 |
| CS001112000019 | 2001-01-31 | 10 |
| CS001112000021 | 2001-12-15 | 10 |
| CS001112000023 | 2004-01-26 | 10 |
| CS001112000024 | 2001-01-16 | 10 |
| CS001112000029 | 2005-01-24 | 10 |
| CS001112000030 | 2003-03-02 | 10 |
| CS001113000004 | 2003-02-22 | 10 |
| CS001113000010 | 2005-05-09 | 10 |
Commentary :
This code is selecting thecustomer_id,birth_day, andera(a calculated field) from thecustomertable. Theerafield is calculated by determining the customer's age (which is not explicitly selected, but is used in the calculation) and then taking the lower value between the age rounded down to the nearest decade and 60.
The results are grouped bycustomer_idandbirth_day. TheHAVINGclause is then used to filter out customers whose era is less than 60.
In essence, the code is selecting information about customers from thecustomertable, calculating their age and era, grouping the results bycustomer_idandbirth_day, and filtering out customers whose era is less than 60.
S-057: Using the extraction results of 056 and the gender code (gender_cd), create new category data representing the combination of gender x age and display 10 cases. The value of the category representing the combination shall be arbitrary.
%%sql
-- Generate a sex and age code by concatenating a one-digit sex code and a two-digit age code.
SELECT
customer_id,
birth_day,
gender_cd || TO_CHAR(LEAST(CAST(TRUNC(age / 10) * 10 AS INTEGER), 60), 'FM00') AS gender_era
FROM
customer
GROUP BY
customer_id,
birth_day
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | birth_day | gender_era |
|---|---|---|
| CS001105000001 | 2000-01-14 | 010 |
| CS001112000009 | 2006-08-24 | 110 |
| CS001112000019 | 2001-01-31 | 110 |
| CS001112000021 | 2001-12-15 | 110 |
| CS001112000023 | 2004-01-26 | 110 |
| CS001112000024 | 2001-01-16 | 110 |
| CS001112000029 | 2005-01-24 | 110 |
| CS001112000030 | 2003-03-02 | 110 |
| CS001113000004 | 2003-02-22 | 110 |
| CS001113000010 | 2005-05-09 | 110 |
Commentary :
This SQL code selects customer_id, birth_day, and a concatenated string composed of gender_cd and era. The era is calculated by taking the age of the customer and rounding it down to the nearest decade, with a maximum era of 60. The gender_era string is generated by concatenating the gender_cd and era.
The SQL code groups the result by customer_id and birth_day, and returns only the first 10 rows of the result.
Overall, this code creates a summary of customer data that includes age and gender, with an additional era dimension. The resulting table could potentially be used for further analysis and segmentation of customers based on age and gender.
S-058: Create a dummy variable for the gender code (gender_cd) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id).
%%sql
-- Note that if there are many categories, the SQL can be very long.
-- If you want to reduce one category, simply delete one of the CASE statements.
SELECT
customer_id,
CASE WHEN gender_cd = '0' THEN '1' ELSE '0' END AS gender_cd_0,
CASE WHEN gender_cd = '1' THEN '1' ELSE '0' END AS gender_cd_1,
CASE WHEN gender_cd = '9' THEN '1' ELSE '0' END AS gender_cd_9
FROM
customer
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | gender_cd_0 | gender_cd_1 | gender_cd_9 |
|---|---|---|---|
| CS021313000114 | 0 | 1 | 0 |
| CS037613000071 | 0 | 0 | 1 |
| CS031415000172 | 0 | 1 | 0 |
| CS028811000001 | 0 | 1 | 0 |
| CS001215000145 | 0 | 1 | 0 |
| CS020401000016 | 1 | 0 | 0 |
| CS015414000103 | 0 | 1 | 0 |
| CS029403000008 | 1 | 0 | 0 |
| CS015804000004 | 1 | 0 | 0 |
| CS033513000180 | 0 | 1 | 0 |
Commentary :
This SQL code is selecting customer_id, and creating three new columns:gender_cd_0,gender_cd_1, andgender_cd_9. These new columns are created using theCASEstatement.
TheCASEstatement is a conditional statement that evaluates a set of conditions and returns a corresponding result. In this case, for each customer, the code checks the value of thegender_cdcolumn. Ifgender_cdis equal to '0', the code returns '1' in thegender_cd_0column and '0' in the other columns. Ifgender_cdis equal to '1', the code returns '1' in thegender_cd_1column and '0' in the other columns. Ifgender_cdis equal to '9', the code returns '1' in thegender_cd_9column and '0' in the other columns.
The resulting output will have customer_id, gender_cd_0, gender_cd_1, and gender_cd_9 columns, where for each customer, only one of the new columns will have a value of '1', indicating the gender of the customer.
S-059: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), standardise the total sales amount to mean 0 and standard deviation 1 and display 10 cases with the customer ID and total sales amount. The standard deviation used for standardisation may be either the square root of the variance or the square root of the unbiased variance. However, exclude customer IDs starting with “Z”, as they represent non-members.
%%sql
-- Code example 1 (standardised with STDDEV_POP)
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
),
stats_amount AS (
SELECT
AVG(sum_amount) AS avg_amount,
STDDEV_POP(sum_amount) AS stddev_amount
FROM
sales_amount
)
SELECT
customer_id,
sum_amount,
(sum_amount - avg_amount) / stddev_amount AS std_amount
FROM
sales_amount
CROSS JOIN
stats_amount
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | sum_amount | std_amount |
|---|---|---|
| CS001311000059 | 2302 | -0.09032946448390523062 |
| CS004614000122 | 248 | -0.84533488878695858131 |
| CS003512000043 | 298 | -0.82695598361891930792 |
| CS011615000061 | 246 | -0.84607004499368015224 |
| CS029212000033 | 3604 | 0.38825722609183744835 |
| CS007515000119 | 7157 | 1.6942622273327082 |
| CS034515000123 | 3699 | 0.42317714591111206778 |
| CS004315000058 | 490 | -0.75638098777364849812 |
| CS026414000014 | 6671 | 1.5156192690993665 |
| CS001615000099 | 768 | -0.65419427503935013810 |
Commentary :
This code is written in SQL language and is using a database management system that supports SQL.
The code is performing a data analysis task on a table calledreceipt. The table contains transactional data with columns such ascustomer_idandamount.
The code is using a common table expression (CTE) to create two temporary tablessales_amountandstats_amount.
The first CTE namedsales_amountis calculating the total amount spent by each customer who has acustomer_idthat does not start with the letter 'Z'. It is grouping the transaction data bycustomer_idand calculating the sum ofamountfor eachcustomer_id.
The second CTE namedstats_amountis calculating the average and standard deviation of the total amount spent by each customer.
The final SELECT statement is selecting columns from thesales_amounttable, and joining it with thestats_amounttable using a cross join. It is also calculating the standardized amount spent by each customer by subtracting the average amount from the total amount and then dividing it by the standard deviation.
TheLIMIT 10at the end limits the output to the first 10 rows.
Overall, this code is performing a statistical analysis on the transactional data to find the standardized amount spent by each customer who doesn't have acustomer_idstarting with the letter 'Z'.
%%sql
-- Code example 2 (standardised with STDDEV_SAMP, values slightly different from code example 2)
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
),
stats_amount AS (
SELECT
AVG(sum_amount) AS avg_amount,
STDDEV_SAMP(sum_amount) AS stddev_amount
FROM
sales_amount
)
SELECT
customer_id,
sum_amount,
(sum_amount - avg_amount) / stddev_amount AS std_amount
FROM
sales_amount
CROSS JOIN
stats_amount
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | sum_amount | std_amount |
|---|---|---|
| CS001311000059 | 2302 | -0.09032402671702291270 |
| CS004614000122 | 248 | -0.84528400025253654164 |
| CS003512000043 | 298 | -0.82690620148098070841 |
| CS011615000061 | 246 | -0.84601911220339877497 |
| CS029212000033 | 3604 | 0.38823385329429098451 |
| CS007515000119 | 7157 | 1.6941602340010485 |
| CS034515000123 | 3699 | 0.42315167096024706764 |
| CS004315000058 | 490 | -0.75633545419820630882 |
| CS026414000014 | 6671 | 1.5155280299415258 |
| CS001615000099 | 768 | -0.65415489302835587608 |
Commentary :
This code is very similar to the previous one, but there is one important difference in thestats_amountCTE.
Instead of using theSTDDEV_POPfunction to calculate the standard deviation, it is using theSTDDEV_SAMPfunction.STDDEV_SAMPcalculates the sample standard deviation, which is an estimate of the population standard deviation based on a sample of data.
In contrast,STDDEV_POPcalculates the exact population standard deviation, assuming that the entire population is available for analysis.
So, the output of this code will be similar to the previous one, but the standard deviation calculation will be based on a sample of the data rather than the entire population.
In general, if the dataset is very large, the difference between the sample and population standard deviation will be small. However, if the dataset is small, the difference can be significant.
S-060: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), normalise the total sales amount to a minimum value of 0 and a maximum value of 1 and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
%%sql
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
),
stats_amount AS (
SELECT
MAX(sum_amount) AS max_amount,
MIN(sum_amount) AS min_amount
FROM
sales_amount
)
SELECT
customer_id,
sum_amount,
1.0 * (sum_amount - min_amount) / (max_amount - min_amount) AS scale_amount
FROM
sales_amount
CROSS JOIN
stats_amount
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
| customer_id | sum_amount | scale_amount |
|---|---|---|
| CS001311000059 | 2302 | 0.09697601668404588113 |
| CS004614000122 | 248 | 0.00773375043448036149 |
| CS003512000043 | 298 | 0.00990615224191866528 |
| CS011615000061 | 246 | 0.00764685436218282934 |
| CS029212000033 | 3604 | 0.15354535974973931178 |
| CS007515000119 | 7157 | 0.30791623218630517901 |
| CS034515000123 | 3699 | 0.15767292318387208898 |
| CS004315000058 | 490 | 0.01824817518248175182 |
| CS026414000014 | 6671 | 0.28680048661800486618 |
| CS001615000099 | 768 | 0.03032672923183872089 |
Commentary :
This code is also analyzing thereceipttable, but instead of calculating the standardized amount, it is calculating the scaled amount spent by each customer.
The code is using two CTEs,sales_amountandstats_amount, which are similar to the previous code.
The first CTE calculates the total amount spent by each customer who has acustomer_idthat does not start with the letter 'Z', and the second CTE calculates the maximum and minimum total amount spent by any customer.
The final SELECT statement selects columns from thesales_amounttable and joins it with thestats_amounttable using a cross join.
It also calculates the scaled amount spent by each customer, which is calculated by subtracting the minimum amount spent from the total amount spent and then dividing it by the range of total amount spent, which is the difference between the maximum and minimum amount spent.
The1.0 *before the expression is used to convert the result to a floating-point number, which is necessary if any of the values used in the expression are floating-point numbers.
TheLIMIT 10at the end limits the output to the first 10 rows.
Overall, this code is performing a data analysis task to find the scaled amount spent by each customer who doesn't have acustomer_idstarting with the letter 'Z'. The scaled amount is a value between 0 and 1 that represents the proportion of the total amount spent by each customer relative to the range of total amount spent by all customers.









Comment