参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- The database is PostgreSQL13
- First, execute the following cell
- You can issue SQL by writing %%sql in the cell
- You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
- You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
- It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
- If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
- Names, addresses etc. are dummy data and do not exist.
%load_ext sql
import os
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)
# Set up Magic commands to write SQL
%sql $dsl
'Connected: padawan@dsdojo_db'
How to use
- You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.
%%sql
SELECT 'You can run it like this' AS sample;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
sample |
---|
You can run it like this |
100 data processing knocks
S-061: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to ordinary logarithm (bottom 10) and display 10 items with customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
%%sql
SELECT
customer_id,
sum_amount,
LOG(sum_amount + 0.5) AS log_amount
FROM
(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
) AS sum_amount_tbl
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount | log_amount |
---|---|---|
CS001311000059 | 2302 | 3.3621996388688865 |
CS004614000122 | 248 | 2.3953263930693509 |
CS003512000043 | 298 | 2.4749443354653879 |
CS011615000061 | 246 | 2.3918169236132488 |
CS029212000033 | 3604 | 3.5568450298595813 |
CS007515000119 | 7157 | 3.8547613566936362 |
CS034515000123 | 3699 | 3.5681430316577019 |
CS004315000058 | 490 | 2.6906390117159673 |
CS026414000014 | 6671 | 3.8242234903608168 |
CS001615000099 | 768 | 2.8856438718357639 |
Commentary :
This is a SQL query that retrieves data from a database. Let's break it down step by step:
The query begins with the%%sql
magic command, which is used in Jupyter notebooks to specify that the following code is SQL code.
The main query selects three columns:customer_id
,sum_amount
, andlog_amount
. These columns will be derived from the subquery in parentheses.
The subquery in parentheses first filters out all the records where thecustomer_id
starts with the letter "Z". This is done with theNOT LIKE 'Z%'
condition.
The filtered records are then grouped bycustomer_id
using theGROUP BY
clause.
TheSUM
function is used to calculate the totalamount
for eachcustomer_id
.
TheAS sum_amount
clause renames the calculated column tosum_amount
.
The subquery is then aliased assum_amount_tbl
.
Finally, the outer query selects thecustomer_id
,sum_amount
, and the logarithm (with base e) of thesum_amount
plus 0.5. This is calculated using theLOG
function and aliased aslog_amount
.
The results are limited to 10 rows with theLIMIT
clause.
S-062: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to the natural logarithm (bottom e) and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
%%sql
SELECT
customer_id,
sum_amount,
LN(sum_amount + 0.5) AS log_amount
FROM
(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
) AS sum_amount_tbl
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount | log_amount |
---|---|---|
CS001311000059 | 2302 | 7.7417507681294619 |
CS004614000122 | 248 | 5.5154428455366834 |
CS003512000043 | 298 | 5.6987699328326568 |
CS011615000061 | 246 | 5.5073619934827448 |
CS029212000033 | 3604 | 8.1899383438446333 |
CS007515000119 | 7157 | 8.8759160369722701 |
CS034515000123 | 3699 | 8.2159529543656056 |
CS004315000058 | 490 | 6.1954252790054178 |
CS026414000014 | 6671 | 8.8056000011824754 |
CS001615000099 | 768 | 6.6444405629786506 |
Commentary :
This is a SQL query that retrieves data from a database. Let's break it down step by step:
The query begins with the%%sql
magic command, which is used in Jupyter notebooks to specify that the following code is SQL code.
The main query selects three columns:customer_id
,sum_amount
, andlog_amount
. These columns will be derived from the subquery in parentheses.
The subquery in parentheses first filters out all the records where thecustomer_id
starts with the letter "Z". This is done with theNOT LIKE 'Z%'
condition.
The filtered records are then grouped bycustomer_id
using theGROUP BY
clause.
TheSUM
function is used to calculate the totalamount
for eachcustomer_id
.
TheAS sum_amount
clause renames the calculated column tosum_amount
.
The subquery is then aliased assum_amount_tbl
.
Finally, the outer query selects thecustomer_id
,sum_amount
, and the natural logarithm (with base e) of thesum_amount
plus 0.5. This is calculated using theLN
function and aliased aslog_amount
.
The results are limited to 10 rows with theLIMIT
clause.
S-063: Calculate the profit amount of each product from the unit price(unit price) and cost of the product(unit_cost) data (df_product) and display 10 results.
%%sql
SELECT
product_cd,
unit_price,
unit_cost,
unit_price - unit_cost AS unit_profit
FROM
product
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
product_cd | unit_price | unit_cost | unit_profit |
---|---|---|---|
P040101001 | 198 | 149 | 49 |
P040101002 | 218 | 164 | 54 |
P040101003 | 230 | 173 | 57 |
P040101004 | 248 | 186 | 62 |
P040101005 | 268 | 201 | 67 |
P040101006 | 298 | 224 | 74 |
P040101007 | 338 | 254 | 84 |
P040101008 | 420 | 315 | 105 |
P040101009 | 498 | 374 | 124 |
P040101010 | 580 | 435 | 145 |
Commentary :
This is a SQL query that retrieves data from a database. Let's break it down step by step:
The query begins with the%%sql
magic command, which is used in Jupyter notebooks to specify that the following code is SQL code.
The main query selects four columns:product_cd
,unit_price
,unit_cost
, andunit_profit
. These columns will be derived from theproduct
table.
TheFROM
clause specifies theproduct
table.
TheLIMIT
clause limits the number of rows returned by the query to 10.
The query calculates theunit_profit
by subtracting theunit_cost
from theunit_price
. This is done using the-
operator.
The results of the query include theproduct_cd
,unit_price
,unit_cost
, andunit_profit
for the first 10 rows in theproduct
table.
Overall, this query retrieves information about the price, cost, and profit for a selection of products.
S-064: Calculate the overall average profit margin for each product from the unit price (unit_price) and cost (unit_cost) of the product data (df_product). Note, however, that unit_price and cost are missing.
%%sql
SELECT
AVG((unit_price * 1.0 - unit_cost) / unit_price) AS unit_profit_rate
FROM
product
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
unit_profit_rate |
---|
0.24911389885177001279 |
Commentary :
This code is using SQL to query the average unit profit rate for the top 10 products in a database table called "product". Here is an explanation of each part of the code:
"%%sql" is a Jupyter notebook magic command that tells the notebook to interpret the following code as SQL.
"SELECT" is a SQL keyword that indicates that we want to retrieve data from the database.
"AVG" is a SQL function that calculates the average of a given set of values.
"((unit_price * 1.0 - unit_cost) / unit_price)" is a formula that calculates the unit profit rate for each product. It subtracts the unit cost from the unit price and divides the result by the unit price. The multiplication by 1.0 is done to ensure that the division operation returns a floating-point value, rather than an integer.
"AS unit_profit_rate" assigns the calculated value to a column called "unit_profit_rate".
"FROM product" specifies that we want to retrieve data from the "product" table.
"LIMIT 10" limits the number of rows returned to 10, meaning we only want to see the results for the top 10 products by some criteria (which is not specified in the code snippet).
S-065: For each product in the product data (df_product), find the new unit price that gives a profit margin of 30%. However, round down to the nearest yen. Then display 10 results and confirm that the profit margin is approximately around 30%. Note that there is a deficit in unit_price(unit_price) and unit_cost(unit_cost).
%%sql
WITH new_price_tbl AS (
SELECT
product_cd,
unit_price,
unit_cost,
TRUNC(unit_cost / 0.7) AS new_price
FROM
product
)
SELECT
*,
(new_price - unit_cost) / new_price AS new_profit_rate
FROM
new_price_tbl
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
product_cd | unit_price | unit_cost | new_price | new_profit_rate |
---|---|---|---|---|
P040101001 | 198 | 149 | 212 | 0.29716981132075471698 |
P040101002 | 218 | 164 | 234 | 0.29914529914529914530 |
P040101003 | 230 | 173 | 247 | 0.29959514170040485830 |
P040101004 | 248 | 186 | 265 | 0.29811320754716981132 |
P040101005 | 268 | 201 | 287 | 0.29965156794425087108 |
P040101006 | 298 | 224 | 320 | 0.30000000000000000000 |
P040101007 | 338 | 254 | 362 | 0.29834254143646408840 |
P040101008 | 420 | 315 | 450 | 0.30000000000000000000 |
P040101009 | 498 | 374 | 534 | 0.29962546816479400749 |
P040101010 | 580 | 435 | 621 | 0.29951690821256038647 |
Commentary :
This code is using SQL to query a modified version of a database table called "product". The modification involves calculating a new price for each product based on its unit cost, and then using this new price to calculate a new profit rate. Here is an explanation of each part of the code:
"%%sql" is a Jupyter notebook magic command that tells the notebook to interpret the following code as SQL.
"WITH new_price_tbl AS (...)" sets up a Common Table Expression (CTE) called "new_price_tbl". This allows us to define a subquery that can be used in the subsequent query.
"( SELECT product_cd, unit_price, unit_cost, TRUNC(unit_cost / 0.7) AS new_price FROM product )" is a subquery that selects data from the "product" table, calculates a new price based on the unit cost, and assigns it to a new column called "new_price". The "TRUNC" function is used to round the calculated new price down to the nearest integer, based on the assumption that the company wants to price their products at a whole number value.
"SELECT *, (new_price - unit_cost) / new_price AS new_profit_rate FROM new_price_tbl LIMIT 10 ;" is the main query that selects all columns from the "new_price_tbl" CTE and calculates a new profit rate for each product. The new profit rate is calculated by subtracting the unit cost from the new price, dividing the result by the new price, and assigning it to a new column called "new_profit_rate". The "LIMIT 10" clause limits the number of rows returned to 10, meaning we only want to see the results for the top 10 products by some criteria (which is not specified in the code snippet).
S-066: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round to the nearest yen (rounding or rounding to even numbers is fine). Then display 10 results and check that the profit margin is around 30%. Note, however, that there are deficiencies in unit_price(unit_price) and unit_cost(unit_cost).
%%sql
WITH new_price_tbl AS (
SELECT
product_cd,
unit_price,
unit_cost,
ROUND(unit_cost / 0.7) AS new_price
FROM
product
)
SELECT
*,
(new_price - unit_cost) / new_price AS new_profit_rate
FROM
new_price_tbl
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
product_cd | unit_price | unit_cost | new_price | new_profit_rate |
---|---|---|---|---|
P040101001 | 198 | 149 | 213 | 0.30046948356807511737 |
P040101002 | 218 | 164 | 234 | 0.29914529914529914530 |
P040101003 | 230 | 173 | 247 | 0.29959514170040485830 |
P040101004 | 248 | 186 | 266 | 0.30075187969924812030 |
P040101005 | 268 | 201 | 287 | 0.29965156794425087108 |
P040101006 | 298 | 224 | 320 | 0.30000000000000000000 |
P040101007 | 338 | 254 | 363 | 0.30027548209366391185 |
P040101008 | 420 | 315 | 450 | 0.30000000000000000000 |
P040101009 | 498 | 374 | 534 | 0.29962546816479400749 |
P040101010 | 580 | 435 | 621 | 0.29951690821256038647 |
Commentary :
This code is written in SQL and is used to select data from a table named 'product' using a common table expression (CTE) named 'new_price_tbl'. The code calculates the new price for each product by dividing its unit cost by 0.7, and then calculates the new profit rate by subtracting the unit cost from the new price and dividing the result by the new price.
Here is a step-by-step breakdown of the code:WITH new_price_tbl AS (...)
: This defines a common table expression (CTE) named 'new_price_tbl' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.SELECT product_cd, unit_price, unit_cost, ROUND(unit_cost / 0.7) AS new_price FROM product
: This selects the product code, unit price, unit cost, and calculates the new price for each product by dividing its unit cost by 0.7 and rounding the result.SELECT *, (new_price - unit_cost) / new_price AS new_profit_rate FROM new_price_tbl LIMIT 10
: This selects all columns from the 'new_price_tbl' CTE, and calculates the new profit rate by subtracting the unit cost from the new price and dividing the result by the new price. Finally, it limits the output to the first 10 rows.
In summary, this code calculates the new price and new profit rate for each product by using a CTE to temporarily store the necessary data and performing arithmetic operations on that data.
S-067: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round up to the nearest yen. Then display 10 results and check that the profit margin is around 30%. Note, however, that there is a deficit in the unit price (unit_price) and cost (unit_cost).
%%sql
WITH new_price_tbl AS (
SELECT
product_cd,
unit_price,
unit_cost,
CEIL(unit_cost / 0.7) AS new_price
FROM
product
)
SELECT
*,
(new_price - unit_cost) / new_price AS new_profit_rate
FROM
new_price_tbl
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
product_cd | unit_price | unit_cost | new_price | new_profit_rate |
---|---|---|---|---|
P040101001 | 198 | 149 | 213 | 0.30046948356807511737 |
P040101002 | 218 | 164 | 235 | 0.30212765957446808511 |
P040101003 | 230 | 173 | 248 | 0.30241935483870967742 |
P040101004 | 248 | 186 | 266 | 0.30075187969924812030 |
P040101005 | 268 | 201 | 288 | 0.30208333333333333333 |
P040101006 | 298 | 224 | 320 | 0.30000000000000000000 |
P040101007 | 338 | 254 | 363 | 0.30027548209366391185 |
P040101008 | 420 | 315 | 450 | 0.30000000000000000000 |
P040101009 | 498 | 374 | 535 | 0.30093457943925233645 |
P040101010 | 580 | 435 | 622 | 0.30064308681672025723 |
Commentary :
This code is similar to the previous one, but instead of rounding the new price, it is rounded up to the nearest integer using the CEIL function.
Here is a step-by-step breakdown of the code:WITH new_price_tbl AS (...)
: This defines a common table expression (CTE) named 'new_price_tbl' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.SELECT product_cd, unit_price, unit_cost, CEIL(unit_cost / 0.7) AS new_price FROM product
: This selects the product code, unit price, unit cost, and calculates the new price for each product by dividing its unit cost by 0.7 and rounding up the result to the nearest integer using the CEIL function.SELECT *, (new_price - unit_cost) / new_price AS new_profit_rate FROM new_price_tbl LIMIT 10
: This selects all columns from the 'new_price_tbl' CTE, and calculates the new profit rate by subtracting the unit cost from the new price and dividing the result by the new price. Finally, it limits the output to the first 10 rows.
In summary, this code calculates the new price and new profit rate for each product by using a CTE to temporarily store the necessary data and performing arithmetic operations on that data. However, the new price is rounded up to the nearest integer using the CEIL function instead of being rounded as in the previous code.
S-068: For each product in the product data (df_product), find the amount including tax at a consumption tax rate of 10%, rounding down fractions of a yen and displaying 10 results. Note that the unit price (unit_price) is missing.
%%sql
SELECT
product_cd,
unit_price,
TRUNC(unit_price * 1.1) AS tax_price
FROM
product
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
product_cd | unit_price | tax_price |
---|---|---|
P040101001 | 198 | 217 |
P040101002 | 218 | 239 |
P040101003 | 230 | 253 |
P040101004 | 248 | 272 |
P040101005 | 268 | 294 |
P040101006 | 298 | 327 |
P040101007 | 338 | 371 |
P040101008 | 420 | 462 |
P040101009 | 498 | 547 |
P040101010 | 580 | 638 |
Commentary :
This code is written in SQL and is used to select data from a table named 'product'. The code selects the product code, unit price, and calculates the tax-inclusive price for each product by multiplying its unit price by 1.1 and truncating the result to remove any decimal places.
Here is a step-by-step breakdown of the code:SELECT product_cd, unit_price, TRUNC(unit_price * 1.1) AS tax_price FROM product
: This selects the product code, unit price, and calculates the tax-inclusive price for each product by multiplying its unit price by 1.1 and truncating the result to remove any decimal places using the TRUNC function.FROM product
: This specifies the 'product' table from which the data is being selected.LIMIT 10
: This limits the output to the first 10 rows.
In summary, this code calculates the tax-inclusive price for each product by multiplying its unit price by 1.1 and truncating the result to remove any decimal places using the TRUNC function.
S-069: Combine receipt details data (df_receipt) and product data (df_product), calculate the total sales value of all products for each customer and the total sales value of products whose category major classification code (category_major_cd) is “07” (bottled canned food), and find the ratio between the two. Only customers with a sales record for category major category code “07” (bottled canned food) should be selected, and 10 results should be displayed.
%%sql
WITH amount_all AS(
SELECT
customer_id,
SUM(amount) AS sum_all
FROM
receipt
GROUP BY
customer_id
),
amount_07 AS (
SELECT
r.customer_id,
SUM(r.amount) AS sum_07
FROM
receipt r
JOIN
product p
ON
r.product_cd = p.product_cd
WHERE
p.category_major_cd = '07'
GROUP BY
customer_id
)
SELECT
amount_all.customer_id,
sum_all,
sum_07,
sum_07 * 1.0 / sum_all AS sales_rate
FROM
amount_all
JOIN
amount_07
ON
amount_all.customer_id = amount_07.customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_all | sum_07 | sales_rate |
---|---|---|---|
CS001311000059 | 2302 | 102 | 0.04430929626411815812 |
CS011615000061 | 246 | 98 | 0.39837398373983739837 |
CS029212000033 | 3604 | 3604 | 1.00000000000000000000 |
CS007515000119 | 7157 | 2832 | 0.39569652088864049183 |
CS034515000123 | 3699 | 1202 | 0.32495268991619356583 |
CS026414000014 | 6671 | 3142 | 0.47099385399490331285 |
CS001615000099 | 768 | 318 | 0.41406250000000000000 |
CS010515000082 | 1482 | 553 | 0.37314439946018893387 |
CS019315000045 | 813 | 380 | 0.46740467404674046740 |
CS008513000099 | 1322 | 210 | 0.15885022692889561271 |
Commentary :
This code is written in SQL and is used to select data from two tables named 'receipt' and 'product' using two common table expressions (CTEs) named 'amount_all' and 'amount_07'. The code calculates the sum of amounts spent by each customer and the sum of amounts spent on products with a major category code of '07'. It then joins the two CTEs and calculates the sales rate of category '07' products for each customer.
Here is a step-by-step breakdown of the code:WITH amount_all AS(...)
: This defines a common table expression (CTE) named 'amount_all' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.SELECT customer_id, SUM(amount) AS sum_all FROM receipt GROUP BY customer_id
: This selects the customer ID and the sum of the amounts spent by each customer from the 'receipt' table, and groups the result by the customer ID., amount_07 AS(...)
: This defines another CTE named 'amount_07' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.SELECT r.customer_id, SUM(r.amount) AS sum_07 FROM receipt r JOIN product p ON r.product_cd = p.product_cd WHERE p.category_major_cd = '07' GROUP BY customer_id
: This selects the customer ID and the sum of the amounts spent on products with a major category code of '07' from the 'receipt' and 'product' tables, joins the two tables on the product code, filters the products by the major category code of '07', and groups the result by the customer ID.SELECT amount_all.customer_id, sum_all, sum_07, sum_07 * 1.0 / sum_all AS sales_rate FROM amount_all JOIN amount_07 ON amount_all.customer_id = amount_07.customer_id LIMIT 10
: This selects the customer ID, the sum of all amounts spent by each customer, the sum of amounts spent on category '07' products by each customer, and the sales rate of category '07' products for each customer by dividing the sum of amounts spent on category '07' products by the sum of all amounts spent by the customer. Finally, it joins the two CTEs on the customer ID and limits the output to the first 10 rows.
In summary, this code calculates the sales rate of category '07' products for each customer by first calculating the sum of all amounts spent by each customer and the sum of amounts spent on category '07' products by each customer, and then dividing the latter by the former. It does so by using two CTEs to temporarily store the necessary data and performing arithmetic operations and joins on that data.
S-070: Calculate the number of days elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt) and display 10 items with the customer ID (customer_id), sales date and membership application date. (Note that sales_ymd is numeric and application_date is a string).
%%sql
WITH receipt_distinct AS (
SELECT DISTINCT
customer_id,
sales_ymd
FROM
receipt
)
SELECT
c.customer_id,
r.sales_ymd,
c.application_date,
EXTRACT(DAY FROM (TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD') - TO_TIMESTAMP(c.application_date, 'YYYYMMDD'))) AS elapsed_days
FROM
receipt_distinct r
JOIN
customer c
ON
r.customer_id = c.customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sales_ymd | application_date | elapsed_days |
---|---|---|---|
CS017515000010 | 20171117 | 20150803 | 837 |
CS019515000097 | 20190630 | 20141124 | 1679 |
CS008515000005 | 20170714 | 20150216 | 879 |
CS026414000097 | 20170809 | 20150430 | 832 |
CS034514000008 | 20181012 | 20150807 | 1162 |
CS029415000089 | 20180409 | 20150723 | 991 |
CS019411000012 | 20190314 | 20141213 | 1552 |
CS015614000006 | 20190802 | 20150211 | 1633 |
CS007515000053 | 20170712 | 20150325 | 840 |
CS024615000041 | 20170729 | 20150918 | 680 |
Commentary :
This code is written in SQL and is used to select data from two tables named 'receipt' and 'customer' using a common table expression (CTE) named 'receipt_distinct'. The code calculates the number of days that elapsed between the customer's application date and the date of a distinct sale made by the customer.
Here is a step-by-step breakdown of the code:WITH receipt_distinct AS(...)
: This defines a common table expression (CTE) named 'receipt_distinct' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.SELECT distinct customer_id, sales_ymd FROM receipt
: This selects the distinct customer ID and sales year-month-day from the 'receipt' table.SELECT c.customer_id, r.sales_ymd, c.application_date, EXTRACT(DAY FROM (TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD') - TO_TIMESTAMP(c.application_date, 'YYYYMMDD'))) AS elapsed_days FROM receipt_distinct r JOIN customer c ON r.customer_id = c.customer_id LIMIT 10
: This selects the customer ID, sales year-month-day, application date, and the elapsed number of days between the customer's application date and the date of a distinct sale made by the customer. The calculation of the elapsed days involves converting the sales year-month-day and application date to timestamps and then computing the difference between them in days. Finally, it joins the 'receipt_distinct' CTE with the 'customer' table on the customer ID and limits the output to the first 10 rows.
In summary, this code calculates the number of days that elapsed between a customer's application date and the date of a distinct sale made by the customer by first selecting the distinct customer ID and sales year-month-day from the 'receipt' table, and then joining it with the 'customer' table to obtain the customer's application date. It then performs a calculation involving the conversion of the sales year-month-day and application date to timestamps and the computation of the difference between them in days. It does so by using a CTE to temporarily store the necessary data and performing arithmetic operations and joins on that data.
S-071: Calculate the number of months elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).
%%sql
WITH receipt_distinct AS (
SELECT DISTINCT
customer_id,
sales_ymd
FROM
receipt
),
time_age_tbl AS(
SELECT
c.customer_id,
r.sales_ymd,
c.application_date,
AGE(TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD'),
TO_TIMESTAMP(c.application_date, 'YYYYMMDD')) AS time_age
FROM
receipt_distinct r
JOIN
customer c
ON
r.customer_id = c.customer_id
)
SELECT
customer_id,
sales_ymd,
application_date,
EXTRACT(YEAR FROM time_age) * 12 + EXTRACT(MONTH FROM time_age) AS elapsed_months
FROM
time_age_tbl
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sales_ymd | application_date | elapsed_months |
---|---|---|---|
CS017515000010 | 20171117 | 20150803 | 27 |
CS019515000097 | 20190630 | 20141124 | 55 |
CS008515000005 | 20170714 | 20150216 | 28 |
CS026414000097 | 20170809 | 20150430 | 27 |
CS034514000008 | 20181012 | 20150807 | 38 |
CS029415000089 | 20180409 | 20150723 | 32 |
CS019411000012 | 20190314 | 20141213 | 51 |
CS015614000006 | 20190802 | 20150211 | 53 |
CS007515000053 | 20170712 | 20150325 | 27 |
CS024615000041 | 20170729 | 20150918 | 22 |
Commentary :
This code is written in SQL and is used to select data from two tables named 'receipt' and 'customer' using two common table expressions (CTEs) named 'receipt_distinct' and 'time_age_tbl'. The code calculates the elapsed number of months between the customer's application date and the date of a distinct sale made by the customer.
Here is a step-by-step breakdown of the code:WITH receipt_distinct AS (...)
: This defines a common table expression (CTE) named 'receipt_distinct' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.SELECT DISTINCT customer_id, sales_ymd FROM receipt
: This selects the distinct customer ID and sales year-month-day from the 'receipt' table.WITH time_age_tbl AS (...)
: This defines another CTE named 'time_age_tbl' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.SELECT c.customer_id, r.sales_ymd, c.application_date, AGE(TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD'), TO_TIMESTAMP(c.application_date, 'YYYYMMDD')) AS time_age FROM receipt_distinct r JOIN customer c ON r.customer_id = c.customer_id
: This selects the customer ID, sales year-month-day, application date, and the elapsed time between the customer's application date and the date of a distinct sale made by the customer. The calculation of the elapsed time involves converting the sales year-month-day and application date to timestamps and then computing the difference between them using the AGE function. Finally, it joins the 'receipt_distinct' CTE with the 'customer' table on the customer ID.SELECT customer_id, sales_ymd, application_date, EXTRACT(YEAR FROM time_age) * 12 + EXTRACT(MONTH FROM time_age) AS elapsed_months FROM time_age_tbl LIMIT 10
: This selects the customer ID, sales year-month-day, application date, and the elapsed number of months between the customer's application date and the date of a distinct sale made by the customer. The calculation of the elapsed months involves extracting the year and month components of the elapsed time and converting them to months. Finally, it limits the output to the first 10 rows.
In summary, this code calculates the elapsed number of months between a customer's application date and the date of a distinct sale made by the customer by first selecting the distinct customer ID and sales year-month-day from the 'receipt' table, and then joining it with the 'customer' table to obtain the customer's application date. It then calculates the elapsed time between the sales year-month-day and application date using a CTE and the AGE function, and converts the result to months. It does so by using two CTEs to temporarily store the necessary data and performing arithmetic operations and joins on that data.
S-072: Calculate the number of years elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (df_customer) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).
%%sql
WITH receipt_distinct AS (
SELECT DISTINCT
customer_id,
sales_ymd
FROM
receipt
)
SELECT
c.customer_id,
r.sales_ymd,
c.application_date,
EXTRACT(YEAR FROM AGE(TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD'),
TO_TIMESTAMP(c.application_date, 'YYYYMMDD'))) AS elapsed_years
FROM
receipt_distinct r
JOIN
customer c
ON
r.customer_id = c.customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sales_ymd | application_date | elapsed_years |
---|---|---|---|
CS017515000010 | 20171117 | 20150803 | 2 |
CS019515000097 | 20190630 | 20141124 | 4 |
CS008515000005 | 20170714 | 20150216 | 2 |
CS026414000097 | 20170809 | 20150430 | 2 |
CS034514000008 | 20181012 | 20150807 | 3 |
CS029415000089 | 20180409 | 20150723 | 2 |
CS019411000012 | 20190314 | 20141213 | 4 |
CS015614000006 | 20190802 | 20150211 | 4 |
CS007515000053 | 20170712 | 20150325 | 2 |
CS024615000041 | 20170729 | 20150918 | 1 |
Commentary :
This SQL code retrieves the customer ID, sales date, and application date of customers who have made purchases, and calculates the number of elapsed years between the application date and the sales date.
The code starts by creating a common table expression calledreceipt_distinct
, which selects distinct combinations of customer ID and sales date from thereceipt
table. This ensures that each customer is only counted once, even if they made multiple purchases.
The main query then selects the customer ID, sales date, and application date from thereceipt_distinct
table, and uses theEXTRACT
function to calculate the number of elapsed years between the application date and the sales date. TheAGE
function is used to calculate the time difference between the two dates, and theEXTRACT
function is used to extract the year component from the resulting interval.
The result is a table that shows the customer ID, sales date, application date, and number of elapsed years for each customer. TheLIMIT
clause is used to restrict the output to the first 10 rows of the result set.
S-073: Calculate the elapsed time in epoch seconds from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and (Note that sales_ymd is held as a number and application_date as a string). Note that no time information is held, so each date should represent 0:00:00:00.
%%sql
WITH receipt_distinct AS (
SELECT DISTINCT
customer_id,
sales_ymd
FROM
receipt
)
SELECT
c.customer_id,
r.sales_ymd,
c.application_date,
EXTRACT(EPOCH FROM TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD') - TO_TIMESTAMP(c.application_date, 'YYYYMMDD')) AS elapsed_epoch
FROM
receipt_distinct r
JOIN
customer c
ON
r.customer_id = c.customer_id
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sales_ymd | application_date | elapsed_epoch |
---|---|---|---|
CS017515000010 | 20171117 | 20150803 | 72316800.000000 |
CS019515000097 | 20190630 | 20141124 | 145065600.000000 |
CS008515000005 | 20170714 | 20150216 | 75945600.000000 |
CS026414000097 | 20170809 | 20150430 | 71884800.000000 |
CS034514000008 | 20181012 | 20150807 | 100396800.000000 |
CS029415000089 | 20180409 | 20150723 | 85622400.000000 |
CS019411000012 | 20190314 | 20141213 | 134092800.000000 |
CS015614000006 | 20190802 | 20150211 | 141091200.000000 |
CS007515000053 | 20170712 | 20150325 | 72576000.000000 |
CS024615000041 | 20170729 | 20150918 | 58752000.000000 |
Commentary :
This SQL code retrieves the customer ID, sales date, and application date of customers who have made purchases, and calculates the number of elapsed seconds between the application date and the sales date.
The code starts by creating a common table expression calledreceipt_distinct
, which selects distinct combinations of customer ID and sales date from thereceipt
table. This ensures that each customer is only counted once, even if they made multiple purchases.
The main query then selects the customer ID, sales date, and application date from thereceipt_distinct
table, and uses theEXTRACT
function along withTO_TIMESTAMP
function to calculate the number of elapsed seconds between the application date and the sales date. TheEXTRACT(EPOCH FROM <interval>)
function is used to extract the number of seconds from the resulting interval.
The result is a table that shows the customer ID, sales date, application date, and number of elapsed seconds for each customer. TheLIMIT
clause is used to restrict the output to the first 10 rows of the result set.
S-074: For the sales date (sales_ymd) of the receipt details data (df_receipt), calculate the number of days elapsed since Monday of the week in question and display 10 items together with the sales date and the previous Monday (note that sales_ymd holds data in numerical form).
%%sql
WITH elapsed_days_tbl AS (
SELECT
TO_DATE(CAST(sales_ymd AS VARCHAR), 'YYYYMMDD') AS sales_ymd,
EXTRACT(DOW FROM (TO_DATE(CAST(sales_ymd AS VARCHAR), 'YYYYMMDD') - 1)) AS elapsed_days
FROM
receipt
)
SELECT
sales_ymd,
elapsed_days,
sales_ymd - CAST(elapsed_days AS INTEGER) AS monday
FROM
elapsed_days_tbl
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
sales_ymd | elapsed_days | monday |
---|---|---|
2018-11-03 | 5 | 2018-10-29 |
2018-11-18 | 6 | 2018-11-12 |
2017-07-12 | 2 | 2017-07-10 |
2019-02-05 | 1 | 2019-02-04 |
2018-08-21 | 1 | 2018-08-20 |
2019-06-05 | 2 | 2019-06-03 |
2018-12-05 | 2 | 2018-12-03 |
2019-09-22 | 6 | 2019-09-16 |
2017-05-04 | 3 | 2017-05-01 |
2019-10-10 | 3 | 2019-10-07 |
Commentary :
This SQL code starts by defining a Common Table Expression (CTE) namedelapsed_days_tbl
, which extracts the day of the week (as a number from 0 to 6, where 0 is Sunday) for eachsales_ymd
date in thereceipt
table.
In the main query,sales_ymd
andelapsed_days
are selected from theelapsed_days_tbl
CTE. Then, the code calculates a new date by subtracting theelapsed_days
fromsales_ymd
. This is done by casting theelapsed_days
as an integer and subtracting it fromsales_ymd
. The resulting date is the Monday of the week corresponding to thesales_ymd
date.
TheLIMIT 10
clause at the end of the code limits the output to the first 10 rows.
S-075: Randomly extract 1% of the data from the customer data (df_customer) and display the first 10 items.
%%sql
-- Code example 1 (If you want to keep it simple. However, the number of cases varies around 1.0%)
SELECT
*
FROM
customer WHERE RANDOM() <= 0.01
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd |
---|---|---|---|---|---|---|---|---|---|---|
CS019415000117 | 宮脇 芽以 | 1 | 女性 | 1974-07-10 | 44 | 173-0036 | 東京都板橋区向原********** | S13019 | 20141114 | C-20100720-D |
CS040513000111 | 寺西 奈央 | 1 | 女性 | 1966-06-03 | 52 | 226-0027 | 神奈川県横浜市緑区長津田********** | S14040 | 20150728 | 6-20090622-6 |
CS029402000041 | 浅利 俊二 | 0 | 男性 | 1975-08-15 | 43 | 134-0013 | 東京都江戸川区江戸川********** | S12029 | 20150220 | 0-00000000-0 |
CS019712000008 | 梅本 ヒカル | 1 | 女性 | 1945-04-14 | 73 | 173-0037 | 東京都板橋区小茂根********** | S13019 | 20150613 | 0-00000000-0 |
CS015713000077 | 長沢 結衣 | 1 | 女性 | 1947-10-09 | 71 | 136-0075 | 東京都江東区新砂********** | S13015 | 20150308 | 0-00000000-0 |
CS008515000014 | 野沢 あさみ | 1 | 女性 | 1959-06-09 | 59 | 157-0067 | 東京都世田谷区喜多見********** | S13008 | 20150219 | 9-20091212-B |
CS031514000047 | 原口 礼子 | 1 | 女性 | 1963-07-26 | 55 | 151-0064 | 東京都渋谷区上原********** | S13031 | 20150927 | 8-20090607-6 |
CS004313000412 | 春日 瞳 | 1 | 女性 | 1984-07-26 | 34 | 176-0024 | 東京都練馬区中村********** | S13004 | 20170525 | 0-00000000-0 |
CS035513000155 | 板倉 昌代 | 1 | 女性 | 1960-06-12 | 58 | 157-0075 | 東京都世田谷区砧公園********** | S13035 | 20150625 | 0-00000000-0 |
CS003512000587 | 大山 沙知絵 | 1 | 女性 | 1959-04-01 | 59 | 214-0014 | 神奈川県川崎市多摩区登戸********** | S13003 | 20170306 | 0-00000000-0 |
Commentary :
This SQL code selects a random subset of 10 customers from thecustomer
table. TheRANDOM()
function generates a random value between 0 and 1 for each row in the table, and theWHERE
clause filters the rows based on whether the generated value is less than or equal to 0.01 (1% chance of being selected).
This means that each row has an equal chance of being selected, regardless of any other criteria or order in the table. TheLIMIT
clause is used to restrict the output to the first 10 rows that meet the condition.
%%sql
-- Code example 2 (If you want to be polite. Create a count and fix the number of outputs)
WITH customer_tmp AS(
SELECT
*,
ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row,
COUNT(*) OVER() AS cnt
FROM
customer
)
SELECT
customer_id,
customer_name,
gender_cd,
gender,
birth_day,
age,
postal_cd,
address,
application_store_cd,
application_date,
status_cd
FROM
customer_tmp
WHERE
row <= cnt * 0.01
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd |
---|---|---|---|---|---|---|---|---|---|---|
CS027514000069 | 土屋 さやか | 9 | 不明 | 1967-02-10 | 52 | 251-0033 | 神奈川県藤沢市片瀬山********** | S14027 | 20141213 | A-20101018-B |
CS031504000012 | 向井 剛基 | 0 | 男性 | 1966-09-19 | 52 | 151-0062 | 東京都渋谷区元代々木町********** | S13031 | 20150310 | 0-00000000-0 |
CS028403000022 | 新垣 三郎 | 0 | 男性 | 1973-09-12 | 45 | 246-0012 | 神奈川県横浜市瀬谷区東野********** | S14028 | 20150905 | 0-00000000-0 |
CS001415000103 | 水谷 千夏 | 1 | 女性 | 1976-12-08 | 42 | 144-0051 | 東京都大田区西蒲田********** | S13001 | 20150509 | A-20100925-C |
CS019313000145 | 早美 由美子 | 1 | 女性 | 1985-02-20 | 34 | 173-0033 | 東京都板橋区大山西町********** | S13019 | 20141226 | 0-00000000-0 |
CS002412000346 | 荒川 美嘉 | 1 | 女性 | 1973-03-22 | 46 | 187-0045 | 東京都小平市学園西町********** | S13002 | 20160808 | 0-00000000-0 |
CS003415000271 | 稲垣 遥 | 1 | 女性 | 1975-12-14 | 43 | 201-0001 | 東京都狛江市西野川********** | S13003 | 20160630 | A-20090325-8 |
CS029502000052 | 岡崎 優一 | 0 | 男性 | 1963-03-28 | 56 | 134-0085 | 東京都江戸川区南葛西********** | S12029 | 20150803 | 0-00000000-0 |
CS002515000290 | 大山 みあ | 1 | 女性 | 1959-06-05 | 59 | 185-0023 | 東京都国分寺市西元町********** | S13002 | 20160627 | C-20100730-B |
CS009314000030 | 西川 奈々 | 1 | 女性 | 1983-05-15 | 35 | 158-0091 | 東京都世田谷区中町********** | S13009 | 20150519 | E-20100910-D |
Commentary :
This code is using the SQL programming language to select a random subset of customers from the "customer" table.
The code first creates a temporary table called "customer_tmp" by selecting all columns from the "customer" table and adding two additional columns using the ROW_NUMBER and COUNT functions.
The ROW_NUMBER function assigns a unique integer to each row in the table, based on the result of the ORDER BY RANDOM() clause. This effectively randomizes the order of the rows in the table.
The COUNT function with the OVER() clause counts the total number of rows in the table, and creates a new column called "cnt" that contains this count for each row.
The main SELECT statement then selects a subset of customers from the temporary table by filtering on the "row" column, which was created by the ROW_NUMBER function. Specifically, it selects the first 1% of rows, as determined by the "cnt" column created by the COUNT function.
The final clause, "LIMIT 10", limits the results to the first 10 rows of the subset, effectively returning a random sample of 10 customers from the original "customer" table.
Overall, this code is a simple way to select a random sample of customers from a large table for testing or exploratory analysis purposes. The sampling percentage (in this case, 1%) can be adjusted to create larger or smaller samples.
S-076: Extract 10% of the data stratified randomly from the customer data (df_customer) based on the percentage of gender code (gender_cd) and count the number of cases by gender code
%%sql
-- Code Example 1
WITH cusotmer_random AS (
SELECT
customer_id,
gender_cd,
cnt
FROM
(
SELECT
gender_cd,
ARRAY_AGG(customer_id ORDER BY RANDOM()) AS customer_r,
COUNT(1) AS cnt
FROM
customer
GROUP BY
gender_cd
)sample, UNNEST(customer_r) AS customer_id
),
cusotmer_rownum AS(
SELECT
* ,
ROW_NUMBER() OVER(PARTITION BY gender_cd) AS rn
FROM
cusotmer_random
)
SELECT
gender_cd,
COUNT(1) AS customer_num
FROM
cusotmer_rownum
WHERE
rn <= cnt * 0.1
GROUP BY
gender_cd
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
gender_cd | customer_num |
---|---|
0 | 298 |
1 | 1791 |
9 | 107 |
Commentary :
This SQL code is selecting a random sample of customers from a table called "customer" and then calculating the number of customers in the sample that fall into each gender category.
Here is a breakdown of what each part of the code is doing:
WITH clause: This clause creates two temporary tables, "customer_random" and "customer_rownum," that will be used in the subsequent SELECT statement.
customer_random table: This table selects a random sample of customers from the "customer" table and assigns each customer a "customer_r" value based on their gender. The "cnt" column counts the total number of customers in each gender group.
customer_rownum table: This table adds a row number ("rn") to each customer in the "customer_random" table, partitioned by gender.
SELECT statement: This statement counts the number of customers in each gender group whose row number is less than or equal to 10% of the total number of customers in that gender group (i.e., the top 10% of customers by row number). The result is a table with two columns, "gender_cd" and "customer_num", where "gender_cd" is the gender code and "customer_num" is the number of customers in that gender group that fall into the top 10% by row number.
Overall, this code is useful for selecting a random sample of customers from a larger dataset and then analyzing their gender distribution.
%%sql
-- Code Example 2
WITH cusotmer_random AS (
SELECT
* ,
ROW_NUMBER() OVER(PARTITION BY gender_cd ORDER BY RANDOM()) AS rn,
COUNT(1) OVER(PARTITION BY gender_cd) cnt
FROM
customer
)
SELECT
gender_cd,
COUNT(1) AS customer_num
FROM
cusotmer_random
WHERE
rn <= cnt * 0.1
GROUP BY
gender_cd
;
* postgresql://padawan:***@db:5432/dsdojo_db 3 rows affected.
gender_cd | customer_num |
---|---|
9 | 107 |
0 | 298 |
1 | 1791 |
Commentary :
This code is using the SQL programming language to randomly select a subset of customers from the "customer" table, grouping them by their gender and returning the number of customers in each gender group.
The code first creates a temporary table called "cusotmer_random" by selecting all columns from the "customer" table and adding two additional columns using the ROW_NUMBER and COUNT functions.
The ROW_NUMBER function assigns a unique integer to each row in the table, based on the result of the ORDER BY RANDOM() clause. This effectively randomizes the order of the rows in the table.
The PARTITION BY gender_cd clause in the ROW_NUMBER function ensures that the random ordering is done separately for each gender group.
The COUNT function with the OVER(PARTITION BY gender_cd) clause counts the total number of rows in each gender group, and creates a new column called "cnt" that contains this count for each row.
The main SELECT statement then selects a subset of customers from the temporary table by filtering on the "rn" column, which was created by the ROW_NUMBER function. Specifically, it selects the first 10% of rows in each gender group, as determined by the "cnt" column created by the COUNT function.
The WHERE clause filters the rows in the "cusotmer_random" table to only include those with an "rn" value less than or equal to 10% of the "cnt" value for their respective gender group.
Finally, the main SELECT statement groups the rows by gender_cd and returns the count of rows in each group, effectively giving the number of customers in each gender group in the random sample.
Overall, this code is a way to randomly select a subset of customers from a large table while ensuring that the sample includes a similar proportion of each gender group as the original table. The sampling percentage (in this case, 10%) can be adjusted to create larger or smaller samples.
S-077: Sum the sales amounts in the receipt details data (df_receipt) by customer unit and extract outliers of the summed sales amounts. The outliers should be calculated by logarithmising the total sales amount, calculating the mean and standard deviation, and then deviating from the mean by more than 3σ (either the natural logarithm or the ordinary logarithm is acceptable). Display 10 results.
%%sql
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount,
LN(SUM(amount) + 0.5) AS log_sum_amount
FROM
receipt
GROUP BY
customer_id
)
SELECT
customer_id,
sum_amount,
log_sum_amount
FROM
sales_amount
CROSS JOIN (
SELECT
AVG(log_sum_amount) AS avg_amount,
STDDEV_POP(log_sum_amount) AS std_amount
FROM
sales_amount
) stats_amount
WHERE
ABS(log_sum_amount - avg_amount) / std_amount > 3
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
customer_id | sum_amount | log_sum_amount |
---|---|---|
ZZ000000000000 | 12395003 | 16.332804005823312 |
Commentary :
This code is using the SQL programming language to identify customers whose total purchase amount falls significantly outside the norm, based on their logarithmically transformed purchase amount.
The code first creates a temporary table called "sales_amount" by selecting the customer_id and sum of purchase amount from the "receipt" table, grouped by customer_id. The sum of purchase amount is also transformed into a logarithmic scale using the LN function, which adds 0.5 to avoid taking the logarithm of zero.
The main SELECT statement then selects the customer_id, sum_amount, and log_sum_amount columns from the temporary table. It also uses a CROSS JOIN to join in another temporary table called "stats_amount", which is created by selecting the average and population standard deviation of the log_sum_amount column from the "sales_amount" table.
The WHERE clause filters the rows in the "sales_amount" table to only include those where the absolute difference between the log_sum_amount and the average log_sum_amount, divided by the standard deviation of the log_sum_amount, is greater than 3. This is a way to identify customers whose purchase amounts are more than three standard deviations from the mean, which is a common threshold for identifying outliers in statistical analysis.
Finally, the main SELECT statement limits the output to 10 rows.
Overall, this code is a way to identify customers who have made significantly large or small purchases, relative to the average purchase amount in the dataset, by transforming the purchase amounts into a logarithmic scale and applying a statistical outlier detection criterion.
S-078: Sum the sales amount (amount) of the receipt details data (df_receipt) in customer units and extract outliers of the summed sales amount. However, exclude those whose customer ID starts with “Z”, as they represent non-members. Outliers are defined as the difference between the first and third quartiles using IQR, and are defined as being below “first quartile – 1.5 x IQR” or above “third quartile + 1.5 x IQR”. Show 10 results.
%%sql
WITH sales_amount AS(
SELECT
customer_id,
SUM(amount) AS sum_amount
FROM
receipt
WHERE
customer_id NOT LIKE 'Z%'
GROUP BY
customer_id
)
SELECT
customer_id,
sum_amount
FROM
sales_amount
CROSS JOIN (
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY sum_amount) AS amount_25per,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY sum_amount) AS amount_75per
FROM
sales_amount
) stats_amount
WHERE
sum_amount < amount_25per - (amount_75per - amount_25per) * 1.5
OR
amount_75per + (amount_75per - amount_25per) * 1.5 < sum_amount
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 10 rows affected.
customer_id | sum_amount |
---|---|
CS013415000226 | 8362 |
CS011415000041 | 9454 |
CS014514000004 | 8872 |
CS021514000008 | 12839 |
CS014515000007 | 9763 |
CS040415000220 | 10158 |
CS028415000161 | 8465 |
CS034515000173 | 10074 |
CS022515000065 | 12903 |
CS007514000094 | 15735 |
Commentary :
This code is using the SQL programming language to query a database. Specifically, it is using a common table expression (CTE) to create a temporary table called "sales_amount", which calculates the total amount of sales for each customer in the "receipt" table.
The WHERE clause of the CTE filters out any customers whose ID starts with the letter "Z". The GROUP BY clause groups the sales by customer ID.
The main query then performs a CROSS JOIN between the "sales_amount" table and a subquery that calculates the 25th and 75th percentiles of the sales amounts using the PERCENTILE_CONT function. These percentiles are stored in a temporary table called "stats_amount".
Finally, the WHERE clause of the main query filters out any sales amounts that are more than 1.5 times the interquartile range (IQR) away from the 25th or 75th percentiles. This is a common method for identifying outliers in a dataset. The results are limited to the top 10 rows.
Overall, this code is querying a database to identify customers with unusually high or low sales amounts compared to their peers, based on their position in the interquartile range.
S-079: For each item of product data (df_product), check the number of missing items.
%%sql
SELECT
SUM(CASE WHEN product_cd IS NULL THEN 1 ELSE 0 END) AS product_cd,
SUM(CASE WHEN category_major_cd IS NULL THEN 1 ELSE 0 END) AS category_major_cd,
SUM(CASE WHEN category_medium_cd IS NULL THEN 1 ELSE 0 END) AS category_medium_cd,
SUM(CASE WHEN category_small_cd IS NULL THEN 1 ELSE 0 END) AS category_small_cd,
SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM
product
LIMIT
10
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 7 | 7 |
Commentary :
This code is using the SQL programming language to query a database. Specifically, it is querying the "product" table and using a combination of the SUM and CASE statements to count the number of missing values (NULLs) in each column of the table.
The query selects six columns and gives them aliases that match the names of the columns in the table: "product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price", and "unit_cost".
For each column, the CASE statement checks whether the value is NULL, and returns 1 if it is, or 0 otherwise. The SUM function then adds up all the 1s and 0s to give the total number of NULL values in each column.
The LIMIT 10 clause limits the results to the first 10 rows of the table, which is not particularly meaningful in this case since we are only interested in the summary statistics.
Overall, this code is a simple way to check for missing data in the "product" table and get a quick summary of the number of missing values in each column.
S-080: Create a new product data, deleting all records where any item in the product data (df_product) is missing. Note that the number of items before and after the deletion should be displayed, and also confirm that the number of items has decreased by the number of items checked in 079.
%%sql
DROP TABLE IF EXISTS product_1;
CREATE TABLE product_1 AS (
SELECT
*
FROM
product
WHERE
unit_price IS NOT NULL
AND
unit_cost IS NOT NULL
);
* postgresql://padawan:***@db:5432/dsdojo_db Done. 10023 rows affected.
[]
%%sql
SELECT
'Before deletion',
COUNT(1)
FROM
product
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
?column? | count |
---|---|
Before deletion | 10030 |
%%sql
SELECT
'After deletion',
COUNT(1)
FROM
product_1
;
* postgresql://padawan:***@db:5432/dsdojo_db 1 rows affected.
?column? | count |
---|---|
Before deletion | 10023 |
Commentary :
This code is using the SQL programming language to create a new table called "product_1" by selecting only the rows from the "product" table where both the "unit_price" and "unit_cost" columns are not NULL.
The first line of the code, "DROP TABLE IF EXISTS product_1;", is a safety measure that checks whether the "product_1" table already exists and drops it if it does, to avoid any conflicts with the new table being created.
The second line, "CREATE TABLE product_1 AS (...);", specifies that a new table called "product_1" should be created, and that its contents should be based on the results of a SELECT statement.
The SELECT statement selects all columns from the "product" table, but includes a WHERE clause that filters out any rows where either the "unit_price" or "unit_cost" column is NULL.
Overall, this code creates a new table that contains only the rows from the original "product" table that have valid data for both the "unit_price" and "unit_cost" columns. This is a common way to clean up data and prepare it for analysis or modeling, as missing data can cause issues with some analytical techniques.
Comment