参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_SQL.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction
How to use
100 data processing knocks

Introduction

The database is PostgreSQL13
First, execute the following cell
You can issue SQL by writing %%sql in the cell
You cannot check the table structure with the describe command from jupyter, so if you want to check the table structure, use a SELECT with a limit, for example.
You can use any familiar SQL client (connection information is as follows)
- IP address: localhost for Docker Desktop, 192.168.99.100 for Docker toolbox
- Port:5432
- Database name: dsdojo_db
- User name: padawan
- Password: padawan12345
It is recommended to limit the number of outputs, as Jupyter may freeze if a large amount of outputs are performed (the number of outputs is also mentioned in the question).
- Data processing also requires the ability to control the amount of results to be displayed for confirmation and to make light work of them.
If a large number of results are output, the file may become so heavy that it cannot be opened afterwards.
- In this case, the results of your work will be lost, but you can retrieve the file from GitHub.
- You can also delete large output ranges using a vim editor, for example.
Names, addresses etc. are dummy data and do not exist.

In [ ]:

%load_ext sql
import os

pgconfig = {
    'host': 'db',
    'port': os.environ['PG_PORT'],
    'database': os.environ['PG_DATABASE'],
    'user': os.environ['PG_USER'],
    'password': os.environ['PG_PASSWORD'],
}
dsl = 'postgresql://{user}:{password}@{host}:{port}/{database}'.format(**pgconfig)

# Set up Magic commands to write SQL
%sql $dsl

Out[ ]:

'Connected: padawan@dsdojo_db'

How to use

You can execute SQL from Jupyter against PostgreSQL by writing %%sql at the beginning of the cell and SQL in the second and subsequent lines.

In [ ]:

%%sql
SELECT 'You can run it like this' AS sample;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

sample
You can run it like this

100 data processing knocks

S-061: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to ordinary logarithm (bottom 10) and display 10 items with customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

%%sql

SELECT
  customer_id,
  sum_amount,
  LOG(sum_amount + 0.5) AS log_amount
FROM
(
  SELECT
      customer_id,
      SUM(amount) AS sum_amount
  FROM
      receipt
  WHERE
      customer_id NOT LIKE 'Z%'
  GROUP BY
      customer_id
) AS sum_amount_tbl
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount	log_amount
CS001311000059	2302	3.3621996388688865
CS004614000122	248	2.3953263930693509
CS003512000043	298	2.4749443354653879
CS011615000061	246	2.3918169236132488
CS029212000033	3604	3.5568450298595813
CS007515000119	7157	3.8547613566936362
CS034515000123	3699	3.5681430316577019
CS004315000058	490	2.6906390117159673
CS026414000014	6671	3.8242234903608168
CS001615000099	768	2.8856438718357639

Commentary :

This is a SQL query that retrieves data from a database. Let's break it down step by step:

The query begins with the %%sql magic command, which is used in Jupyter notebooks to specify that the following code is SQL code.

The main query selects three columns: customer_id, sum_amount, and log_amount. These columns will be derived from the subquery in parentheses.

The subquery in parentheses first filters out all the records where the customer_id starts with the letter "Z". This is done with the NOT LIKE 'Z%' condition.

The filtered records are then grouped by customer_id using the GROUP BY clause.

The SUM function is used to calculate the total amount for each customer_id.

The AS sum_amount clause renames the calculated column to sum_amount.

The subquery is then aliased as sum_amount_tbl.

Finally, the outer query selects the customer_id, sum_amount, and the logarithm (with base e) of the sum_amount plus 0.5. This is calculated using the LOG function and aliased as log_amount.

The results are limited to 10 rows with the LIMIT clause.

S-062: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to the natural logarithm (bottom e) and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

%%sql

SELECT
  customer_id,
  sum_amount,
  LN(sum_amount + 0.5) AS log_amount
FROM
(
  SELECT
      customer_id,
      SUM(amount) AS sum_amount
  FROM
      receipt
  WHERE
      customer_id NOT LIKE 'Z%'
  GROUP BY
      customer_id
) AS sum_amount_tbl
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount	log_amount
CS001311000059	2302	7.7417507681294619
CS004614000122	248	5.5154428455366834
CS003512000043	298	5.6987699328326568
CS011615000061	246	5.5073619934827448
CS029212000033	3604	8.1899383438446333
CS007515000119	7157	8.8759160369722701
CS034515000123	3699	8.2159529543656056
CS004315000058	490	6.1954252790054178
CS026414000014	6671	8.8056000011824754
CS001615000099	768	6.6444405629786506

Commentary :

This is a SQL query that retrieves data from a database. Let's break it down step by step:

The query begins with the %%sql magic command, which is used in Jupyter notebooks to specify that the following code is SQL code.

The main query selects three columns: customer_id, sum_amount, and log_amount. These columns will be derived from the subquery in parentheses.

The subquery in parentheses first filters out all the records where the customer_id starts with the letter "Z". This is done with the NOT LIKE 'Z%' condition.

The filtered records are then grouped by customer_id using the GROUP BY clause.

The SUM function is used to calculate the total amount for each customer_id.

The AS sum_amount clause renames the calculated column to sum_amount.

The subquery is then aliased as sum_amount_tbl.

Finally, the outer query selects the customer_id, sum_amount, and the natural logarithm (with base e) of the sum_amount plus 0.5. This is calculated using the LN function and aliased as log_amount.

The results are limited to 10 rows with the LIMIT clause.

S-063: Calculate the profit amount of each product from the unit price(unit price) and cost of the product(unit_cost) data (df_product) and display 10 results.

In [ ]:

%%sql

SELECT
  product_cd, 
  unit_price, 
  unit_cost,
  unit_price - unit_cost AS unit_profit
FROM
  product
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

product_cd	unit_price	unit_cost	unit_profit
P040101001	198	149	49
P040101002	218	164	54
P040101003	230	173	57
P040101004	248	186	62
P040101005	268	201	67
P040101006	298	224	74
P040101007	338	254	84
P040101008	420	315	105
P040101009	498	374	124
P040101010	580	435	145

Commentary :

This is a SQL query that retrieves data from a database. Let's break it down step by step:

The query begins with the %%sql magic command, which is used in Jupyter notebooks to specify that the following code is SQL code.

The main query selects four columns: product_cd, unit_price, unit_cost, and unit_profit. These columns will be derived from the product table.

The FROM clause specifies the product table.

The LIMIT clause limits the number of rows returned by the query to 10.

The query calculates the unit_profit by subtracting the unit_cost from the unit_price. This is done using the - operator.

The results of the query include the product_cd, unit_price, unit_cost, and unit_profit for the first 10 rows in the product table.

Overall, this query retrieves information about the price, cost, and profit for a selection of products.

S-064: Calculate the overall average profit margin for each product from the unit price (unit_price) and cost (unit_cost) of the product data (df_product). Note, however, that unit_price and cost are missing.

In [ ]:

%%sql

SELECT
  AVG((unit_price * 1.0 - unit_cost) / unit_price) AS unit_profit_rate
FROM
  product
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

unit_profit_rate
0.24911389885177001279

Commentary :

This code is using SQL to query the average unit profit rate for the top 10 products in a database table called "product". Here is an explanation of each part of the code:

"%%sql" is a Jupyter notebook magic command that tells the notebook to interpret the following code as SQL.

"SELECT" is a SQL keyword that indicates that we want to retrieve data from the database.

"AVG" is a SQL function that calculates the average of a given set of values.

"((unit_price * 1.0 - unit_cost) / unit_price)" is a formula that calculates the unit profit rate for each product. It subtracts the unit cost from the unit price and divides the result by the unit price. The multiplication by 1.0 is done to ensure that the division operation returns a floating-point value, rather than an integer.

"AS unit_profit_rate" assigns the calculated value to a column called "unit_profit_rate".

"FROM product" specifies that we want to retrieve data from the "product" table.

"LIMIT 10" limits the number of rows returned to 10, meaning we only want to see the results for the top 10 products by some criteria (which is not specified in the code snippet).

S-065: For each product in the product data (df_product), find the new unit price that gives a profit margin of 30%. However, round down to the nearest yen. Then display 10 results and confirm that the profit margin is approximately around 30%. Note that there is a deficit in unit_price（unit_price） and unit_cost（unit_cost）.

In [ ]:

%%sql

WITH new_price_tbl AS (
  SELECT
      product_cd, 
      unit_price, 
      unit_cost,
      TRUNC(unit_cost / 0.7) AS new_price
  FROM
      product
) 
SELECT
  *,
  (new_price - unit_cost) / new_price AS new_profit_rate
FROM
  new_price_tbl
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

product_cd	unit_price	unit_cost	new_price	new_profit_rate
P040101001	198	149	212	0.29716981132075471698
P040101002	218	164	234	0.29914529914529914530
P040101003	230	173	247	0.29959514170040485830
P040101004	248	186	265	0.29811320754716981132
P040101005	268	201	287	0.29965156794425087108
P040101006	298	224	320	0.30000000000000000000
P040101007	338	254	362	0.29834254143646408840
P040101008	420	315	450	0.30000000000000000000
P040101009	498	374	534	0.29962546816479400749
P040101010	580	435	621	0.29951690821256038647

Commentary :

This code is using SQL to query a modified version of a database table called "product". The modification involves calculating a new price for each product based on its unit cost, and then using this new price to calculate a new profit rate. Here is an explanation of each part of the code:

"%%sql" is a Jupyter notebook magic command that tells the notebook to interpret the following code as SQL.

"WITH new_price_tbl AS (...)" sets up a Common Table Expression (CTE) called "new_price_tbl". This allows us to define a subquery that can be used in the subsequent query.

"( SELECT product_cd, unit_price, unit_cost, TRUNC(unit_cost / 0.7) AS new_price FROM product )" is a subquery that selects data from the "product" table, calculates a new price based on the unit cost, and assigns it to a new column called "new_price". The "TRUNC" function is used to round the calculated new price down to the nearest integer, based on the assumption that the company wants to price their products at a whole number value.

"SELECT *, (new_price - unit_cost) / new_price AS new_profit_rate FROM new_price_tbl LIMIT 10 ;" is the main query that selects all columns from the "new_price_tbl" CTE and calculates a new profit rate for each product. The new profit rate is calculated by subtracting the unit cost from the new price, dividing the result by the new price, and assigning it to a new column called "new_profit_rate". The "LIMIT 10" clause limits the number of rows returned to 10, meaning we only want to see the results for the top 10 products by some criteria (which is not specified in the code snippet).

S-066: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round to the nearest yen (rounding or rounding to even numbers is fine). Then display 10 results and check that the profit margin is around 30%. Note, however, that there are deficiencies in unit_price（unit_price） and unit_cost（unit_cost）.

In [ ]:

%%sql

WITH new_price_tbl AS (
  SELECT
      product_cd, 
      unit_price, 
      unit_cost,
      ROUND(unit_cost / 0.7) AS new_price
  FROM
       product
) 
SELECT
  *,
  (new_price - unit_cost) / new_price AS new_profit_rate
FROM
  new_price_tbl
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

product_cd	unit_price	unit_cost	new_price	new_profit_rate
P040101001	198	149	213	0.30046948356807511737
P040101002	218	164	234	0.29914529914529914530
P040101003	230	173	247	0.29959514170040485830
P040101004	248	186	266	0.30075187969924812030
P040101005	268	201	287	0.29965156794425087108
P040101006	298	224	320	0.30000000000000000000
P040101007	338	254	363	0.30027548209366391185
P040101008	420	315	450	0.30000000000000000000
P040101009	498	374	534	0.29962546816479400749
P040101010	580	435	621	0.29951690821256038647

Commentary :

This code is written in SQL and is used to select data from a table named 'product' using a common table expression (CTE) named 'new_price_tbl'. The code calculates the new price for each product by dividing its unit cost by 0.7, and then calculates the new profit rate by subtracting the unit cost from the new price and dividing the result by the new price.

Here is a step-by-step breakdown of the code:

WITH new_price_tbl AS (...): This defines a common table expression (CTE) named 'new_price_tbl' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.

SELECT product_cd, unit_price, unit_cost, ROUND(unit_cost / 0.7) AS new_price FROM product: This selects the product code, unit price, unit cost, and calculates the new price for each product by dividing its unit cost by 0.7 and rounding the result.

SELECT *, (new_price - unit_cost) / new_price AS new_profit_rate FROM new_price_tbl LIMIT 10: This selects all columns from the 'new_price_tbl' CTE, and calculates the new profit rate by subtracting the unit cost from the new price and dividing the result by the new price. Finally, it limits the output to the first 10 rows.

In summary, this code calculates the new price and new profit rate for each product by using a CTE to temporarily store the necessary data and performing arithmetic operations on that data.

S-067: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round up to the nearest yen. Then display 10 results and check that the profit margin is around 30%. Note, however, that there is a deficit in the unit price (unit_price) and cost (unit_cost).

In [ ]:

%%sql

WITH new_price_tbl AS (
  SELECT
      product_cd, 
      unit_price, 
      unit_cost,
      CEIL(unit_cost / 0.7) AS new_price
  FROM
      product
) 
SELECT
  *,
  (new_price - unit_cost) / new_price AS new_profit_rate
FROM
  new_price_tbl
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

product_cd	unit_price	unit_cost	new_price	new_profit_rate
P040101001	198	149	213	0.30046948356807511737
P040101002	218	164	235	0.30212765957446808511
P040101003	230	173	248	0.30241935483870967742
P040101004	248	186	266	0.30075187969924812030
P040101005	268	201	288	0.30208333333333333333
P040101006	298	224	320	0.30000000000000000000
P040101007	338	254	363	0.30027548209366391185
P040101008	420	315	450	0.30000000000000000000
P040101009	498	374	535	0.30093457943925233645
P040101010	580	435	622	0.30064308681672025723

Commentary :

This code is similar to the previous one, but instead of rounding the new price, it is rounded up to the nearest integer using the CEIL function.

Here is a step-by-step breakdown of the code:

WITH new_price_tbl AS (...): This defines a common table expression (CTE) named 'new_price_tbl' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.

SELECT product_cd, unit_price, unit_cost, CEIL(unit_cost / 0.7) AS new_price FROM product: This selects the product code, unit price, unit cost, and calculates the new price for each product by dividing its unit cost by 0.7 and rounding up the result to the nearest integer using the CEIL function.

SELECT *, (new_price - unit_cost) / new_price AS new_profit_rate FROM new_price_tbl LIMIT 10: This selects all columns from the 'new_price_tbl' CTE, and calculates the new profit rate by subtracting the unit cost from the new price and dividing the result by the new price. Finally, it limits the output to the first 10 rows.

In summary, this code calculates the new price and new profit rate for each product by using a CTE to temporarily store the necessary data and performing arithmetic operations on that data. However, the new price is rounded up to the nearest integer using the CEIL function instead of being rounded as in the previous code.

S-068: For each product in the product data (df_product), find the amount including tax at a consumption tax rate of 10%, rounding down fractions of a yen and displaying 10 results. Note that the unit price (unit_price) is missing.

In [ ]:

%%sql

SELECT
  product_cd,
  unit_price,
  TRUNC(unit_price * 1.1) AS tax_price
FROM
  product
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

product_cd	unit_price	tax_price
P040101001	198	217
P040101002	218	239
P040101003	230	253
P040101004	248	272
P040101005	268	294
P040101006	298	327
P040101007	338	371
P040101008	420	462
P040101009	498	547
P040101010	580	638

Commentary :

This code is written in SQL and is used to select data from a table named 'product'. The code selects the product code, unit price, and calculates the tax-inclusive price for each product by multiplying its unit price by 1.1 and truncating the result to remove any decimal places.

Here is a step-by-step breakdown of the code:

SELECT product_cd, unit_price, TRUNC(unit_price * 1.1) AS tax_price FROM product: This selects the product code, unit price, and calculates the tax-inclusive price for each product by multiplying its unit price by 1.1 and truncating the result to remove any decimal places using the TRUNC function.

FROM product: This specifies the 'product' table from which the data is being selected.

LIMIT 10: This limits the output to the first 10 rows.

In summary, this code calculates the tax-inclusive price for each product by multiplying its unit price by 1.1 and truncating the result to remove any decimal places using the TRUNC function.

S-069: Combine receipt details data (df_receipt) and product data (df_product), calculate the total sales value of all products for each customer and the total sales value of products whose category major classification code (category_major_cd) is “07” (bottled canned food), and find the ratio between the two. Only customers with a sales record for category major category code “07” (bottled canned food) should be selected, and 10 results should be displayed.

In [ ]:

%%sql

WITH amount_all AS(
  SELECT
      customer_id,
      SUM(amount) AS sum_all
  FROM
      receipt
  GROUP BY
      customer_id
),
amount_07 AS (
  SELECT
      r.customer_id,
      SUM(r.amount) AS sum_07
  FROM
      receipt r
  JOIN
      product p
  ON
      r.product_cd = p.product_cd
  WHERE
      p.category_major_cd = '07'
  GROUP BY
      customer_id
)
SELECT
  amount_all.customer_id,
  sum_all,
  sum_07,
   sum_07 * 1.0 / sum_all AS sales_rate
FROM
  amount_all
JOIN
  amount_07
ON
  amount_all.customer_id = amount_07.customer_id
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_all	sum_07	sales_rate
CS001311000059	2302	102	0.04430929626411815812
CS011615000061	246	98	0.39837398373983739837
CS029212000033	3604	3604	1.00000000000000000000
CS007515000119	7157	2832	0.39569652088864049183
CS034515000123	3699	1202	0.32495268991619356583
CS026414000014	6671	3142	0.47099385399490331285
CS001615000099	768	318	0.41406250000000000000
CS010515000082	1482	553	0.37314439946018893387
CS019315000045	813	380	0.46740467404674046740
CS008513000099	1322	210	0.15885022692889561271

Commentary :

This code is written in SQL and is used to select data from two tables named 'receipt' and 'product' using two common table expressions (CTEs) named 'amount_all' and 'amount_07'. The code calculates the sum of amounts spent by each customer and the sum of amounts spent on products with a major category code of '07'. It then joins the two CTEs and calculates the sales rate of category '07' products for each customer.

Here is a step-by-step breakdown of the code:

WITH amount_all AS(...): This defines a common table expression (CTE) named 'amount_all' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.

SELECT customer_id, SUM(amount) AS sum_all FROM receipt GROUP BY customer_id: This selects the customer ID and the sum of the amounts spent by each customer from the 'receipt' table, and groups the result by the customer ID.

, amount_07 AS(...): This defines another CTE named 'amount_07' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.

SELECT r.customer_id, SUM(r.amount) AS sum_07 FROM receipt r JOIN product p ON r.product_cd = p.product_cd WHERE p.category_major_cd = '07' GROUP BY customer_id: This selects the customer ID and the sum of the amounts spent on products with a major category code of '07' from the 'receipt' and 'product' tables, joins the two tables on the product code, filters the products by the major category code of '07', and groups the result by the customer ID.

SELECT amount_all.customer_id, sum_all, sum_07, sum_07 * 1.0 / sum_all AS sales_rate FROM amount_all JOIN amount_07 ON amount_all.customer_id = amount_07.customer_id LIMIT 10: This selects the customer ID, the sum of all amounts spent by each customer, the sum of amounts spent on category '07' products by each customer, and the sales rate of category '07' products for each customer by dividing the sum of amounts spent on category '07' products by the sum of all amounts spent by the customer. Finally, it joins the two CTEs on the customer ID and limits the output to the first 10 rows.

In summary, this code calculates the sales rate of category '07' products for each customer by first calculating the sum of all amounts spent by each customer and the sum of amounts spent on category '07' products by each customer, and then dividing the latter by the former. It does so by using two CTEs to temporarily store the necessary data and performing arithmetic operations and joins on that data.

S-070: Calculate the number of days elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt) and display 10 items with the customer ID (customer_id), sales date and membership application date. (Note that sales_ymd is numeric and application_date is a string).

In [ ]:

%%sql

WITH receipt_distinct AS (
  SELECT DISTINCT
      customer_id,
      sales_ymd
  FROM
      receipt
)
SELECT
  c.customer_id,
  r.sales_ymd,
  c.application_date,
  EXTRACT(DAY FROM (TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD') - TO_TIMESTAMP(c.application_date, 'YYYYMMDD'))) AS elapsed_days
FROM
  receipt_distinct r
JOIN
  customer c
ON
  r.customer_id = c.customer_id
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sales_ymd	application_date	elapsed_days
CS017515000010	20171117	20150803	837
CS019515000097	20190630	20141124	1679
CS008515000005	20170714	20150216	879
CS026414000097	20170809	20150430	832
CS034514000008	20181012	20150807	1162
CS029415000089	20180409	20150723	991
CS019411000012	20190314	20141213	1552
CS015614000006	20190802	20150211	1633
CS007515000053	20170712	20150325	840
CS024615000041	20170729	20150918	680

Commentary :

This code is written in SQL and is used to select data from two tables named 'receipt' and 'customer' using a common table expression (CTE) named 'receipt_distinct'. The code calculates the number of days that elapsed between the customer's application date and the date of a distinct sale made by the customer.

Here is a step-by-step breakdown of the code:

WITH receipt_distinct AS(...): This defines a common table expression (CTE) named 'receipt_distinct' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.

SELECT distinct customer_id, sales_ymd FROM receipt: This selects the distinct customer ID and sales year-month-day from the 'receipt' table.

SELECT c.customer_id, r.sales_ymd, c.application_date, EXTRACT(DAY FROM (TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD') - TO_TIMESTAMP(c.application_date, 'YYYYMMDD'))) AS elapsed_days FROM receipt_distinct r JOIN customer c ON r.customer_id = c.customer_id LIMIT 10: This selects the customer ID, sales year-month-day, application date, and the elapsed number of days between the customer's application date and the date of a distinct sale made by the customer. The calculation of the elapsed days involves converting the sales year-month-day and application date to timestamps and then computing the difference between them in days. Finally, it joins the 'receipt_distinct' CTE with the 'customer' table on the customer ID and limits the output to the first 10 rows.

In summary, this code calculates the number of days that elapsed between a customer's application date and the date of a distinct sale made by the customer by first selecting the distinct customer ID and sales year-month-day from the 'receipt' table, and then joining it with the 'customer' table to obtain the customer's application date. It then performs a calculation involving the conversion of the sales year-month-day and application date to timestamps and the computation of the difference between them in days. It does so by using a CTE to temporarily store the necessary data and performing arithmetic operations and joins on that data.

S-071: Calculate the number of months elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).

In [ ]:

%%sql

WITH receipt_distinct AS (
  SELECT DISTINCT
      customer_id,
      sales_ymd
  FROM
      receipt
),
time_age_tbl AS(
  SELECT
      c.customer_id,
      r.sales_ymd,
      c.application_date,
      AGE(TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD'),
      TO_TIMESTAMP(c.application_date, 'YYYYMMDD')) AS time_age
  FROM
      receipt_distinct r
  JOIN
      customer c
  ON
      r.customer_id = c.customer_id
)
SELECT
  customer_id, 
  sales_ymd, 
  application_date,
  EXTRACT(YEAR FROM time_age) * 12  + EXTRACT(MONTH FROM time_age) AS elapsed_months
FROM
  time_age_tbl
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sales_ymd	application_date	elapsed_months
CS017515000010	20171117	20150803	27
CS019515000097	20190630	20141124	55
CS008515000005	20170714	20150216	28
CS026414000097	20170809	20150430	27
CS034514000008	20181012	20150807	38
CS029415000089	20180409	20150723	32
CS019411000012	20190314	20141213	51
CS015614000006	20190802	20150211	53
CS007515000053	20170712	20150325	27
CS024615000041	20170729	20150918	22

Commentary :

This code is written in SQL and is used to select data from two tables named 'receipt' and 'customer' using two common table expressions (CTEs) named 'receipt_distinct' and 'time_age_tbl'. The code calculates the elapsed number of months between the customer's application date and the date of a distinct sale made by the customer.

Here is a step-by-step breakdown of the code:

WITH receipt_distinct AS (...): This defines a common table expression (CTE) named 'receipt_distinct' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.

SELECT DISTINCT customer_id, sales_ymd FROM receipt: This selects the distinct customer ID and sales year-month-day from the 'receipt' table.

WITH time_age_tbl AS (...): This defines another CTE named 'time_age_tbl' that is used to temporarily store the data from the SELECT statement enclosed in parentheses.

SELECT c.customer_id, r.sales_ymd, c.application_date, AGE(TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD'), TO_TIMESTAMP(c.application_date, 'YYYYMMDD')) AS time_age FROM receipt_distinct r JOIN customer c ON r.customer_id = c.customer_id: This selects the customer ID, sales year-month-day, application date, and the elapsed time between the customer's application date and the date of a distinct sale made by the customer. The calculation of the elapsed time involves converting the sales year-month-day and application date to timestamps and then computing the difference between them using the AGE function. Finally, it joins the 'receipt_distinct' CTE with the 'customer' table on the customer ID.

SELECT customer_id, sales_ymd, application_date, EXTRACT(YEAR FROM time_age) * 12 + EXTRACT(MONTH FROM time_age) AS elapsed_months FROM time_age_tbl LIMIT 10: This selects the customer ID, sales year-month-day, application date, and the elapsed number of months between the customer's application date and the date of a distinct sale made by the customer. The calculation of the elapsed months involves extracting the year and month components of the elapsed time and converting them to months. Finally, it limits the output to the first 10 rows.

In summary, this code calculates the elapsed number of months between a customer's application date and the date of a distinct sale made by the customer by first selecting the distinct customer ID and sales year-month-day from the 'receipt' table, and then joining it with the 'customer' table to obtain the customer's application date. It then calculates the elapsed time between the sales year-month-day and application date using a CTE and the AGE function, and converts the result to months. It does so by using two CTEs to temporarily store the necessary data and performing arithmetic operations and joins on that data.

S-072: Calculate the number of years elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (df_customer) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).

In [ ]:

%%sql

WITH receipt_distinct AS (
  SELECT DISTINCT
    customer_id,
    sales_ymd
  FROM
    receipt
)
SELECT
  c.customer_id,
  r.sales_ymd,
  c.application_date,
  EXTRACT(YEAR FROM AGE(TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD'), 
  TO_TIMESTAMP(c.application_date, 'YYYYMMDD'))) AS elapsed_years
FROM
  receipt_distinct r
JOIN
  customer c
ON
  r.customer_id = c.customer_id
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sales_ymd	application_date	elapsed_years
CS017515000010	20171117	20150803	2
CS019515000097	20190630	20141124	4
CS008515000005	20170714	20150216	2
CS026414000097	20170809	20150430	2
CS034514000008	20181012	20150807	3
CS029415000089	20180409	20150723	2
CS019411000012	20190314	20141213	4
CS015614000006	20190802	20150211	4
CS007515000053	20170712	20150325	2
CS024615000041	20170729	20150918	1

Commentary :

This SQL code retrieves the customer ID, sales date, and application date of customers who have made purchases, and calculates the number of elapsed years between the application date and the sales date.

The code starts by creating a common table expression called receipt_distinct, which selects distinct combinations of customer ID and sales date from the receipt table. This ensures that each customer is only counted once, even if they made multiple purchases.

The main query then selects the customer ID, sales date, and application date from the receipt_distinct table, and uses the EXTRACT function to calculate the number of elapsed years between the application date and the sales date. The AGE function is used to calculate the time difference between the two dates, and the EXTRACT function is used to extract the year component from the resulting interval.

The result is a table that shows the customer ID, sales date, application date, and number of elapsed years for each customer. The LIMIT clause is used to restrict the output to the first 10 rows of the result set.

S-073: Calculate the elapsed time in epoch seconds from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and (Note that sales_ymd is held as a number and application_date as a string). Note that no time information is held, so each date should represent 0:00:00:00.

In [ ]:

%%sql

WITH receipt_distinct AS (
  SELECT DISTINCT
    customer_id,
    sales_ymd
  FROM
    receipt
)
SELECT
  c.customer_id,
  r.sales_ymd,
  c.application_date,  
  EXTRACT(EPOCH FROM TO_TIMESTAMP(CAST(r.sales_ymd AS VARCHAR), 'YYYYMMDD') -  TO_TIMESTAMP(c.application_date, 'YYYYMMDD')) AS elapsed_epoch
FROM
  receipt_distinct r
JOIN
  customer c
ON
  r.customer_id = c.customer_id
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sales_ymd	application_date	elapsed_epoch
CS017515000010	20171117	20150803	72316800.000000
CS019515000097	20190630	20141124	145065600.000000
CS008515000005	20170714	20150216	75945600.000000
CS026414000097	20170809	20150430	71884800.000000
CS034514000008	20181012	20150807	100396800.000000
CS029415000089	20180409	20150723	85622400.000000
CS019411000012	20190314	20141213	134092800.000000
CS015614000006	20190802	20150211	141091200.000000
CS007515000053	20170712	20150325	72576000.000000
CS024615000041	20170729	20150918	58752000.000000

Commentary :

This SQL code retrieves the customer ID, sales date, and application date of customers who have made purchases, and calculates the number of elapsed seconds between the application date and the sales date.

The code starts by creating a common table expression called receipt_distinct, which selects distinct combinations of customer ID and sales date from the receipt table. This ensures that each customer is only counted once, even if they made multiple purchases.

The main query then selects the customer ID, sales date, and application date from the receipt_distinct table, and uses the EXTRACT function along with TO_TIMESTAMP function to calculate the number of elapsed seconds between the application date and the sales date. The EXTRACT(EPOCH FROM <interval>) function is used to extract the number of seconds from the resulting interval.

The result is a table that shows the customer ID, sales date, application date, and number of elapsed seconds for each customer. The LIMIT clause is used to restrict the output to the first 10 rows of the result set.

S-074: For the sales date (sales_ymd) of the receipt details data (df_receipt), calculate the number of days elapsed since Monday of the week in question and display 10 items together with the sales date and the previous Monday (note that sales_ymd holds data in numerical form).

In [ ]:

%%sql

WITH elapsed_days_tbl AS (
  SELECT
    TO_DATE(CAST(sales_ymd AS VARCHAR), 'YYYYMMDD') AS sales_ymd,
    EXTRACT(DOW FROM (TO_DATE(CAST(sales_ymd AS VARCHAR), 'YYYYMMDD') - 1)) AS elapsed_days
  FROM
    receipt
)
SELECT
  sales_ymd,
  elapsed_days,
  sales_ymd - CAST(elapsed_days AS INTEGER) AS monday
FROM 
  elapsed_days_tbl
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

sales_ymd	elapsed_days	monday
2018-11-03	5	2018-10-29
2018-11-18	6	2018-11-12
2017-07-12	2	2017-07-10
2019-02-05	1	2019-02-04
2018-08-21	1	2018-08-20
2019-06-05	2	2019-06-03
2018-12-05	2	2018-12-03
2019-09-22	6	2019-09-16
2017-05-04	3	2017-05-01
2019-10-10	3	2019-10-07

Commentary :

This SQL code starts by defining a Common Table Expression (CTE) named elapsed_days_tbl, which extracts the day of the week (as a number from 0 to 6, where 0 is Sunday) for each sales_ymd date in the receipt table.

In the main query, sales_ymd and elapsed_days are selected from the elapsed_days_tbl CTE. Then, the code calculates a new date by subtracting the elapsed_days from sales_ymd. This is done by casting the elapsed_days as an integer and subtracting it from sales_ymd. The resulting date is the Monday of the week corresponding to the sales_ymd date.

The LIMIT 10 clause at the end of the code limits the output to the first 10 rows.

S-075: Randomly extract 1% of the data from the customer data (df_customer) and display the first 10 items.

In [ ]:

%%sql

-- Code example 1 (If you want to keep it simple. However, the number of cases varies around 1.0%)
SELECT 
  * 
FROM 
  customer WHERE RANDOM() <= 0.01
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
CS019415000117	宮脇芽以	1	女性	1974-07-10	44	173-0036	東京都板橋区向原**********	S13019	20141114	C-20100720-D
CS040513000111	寺西奈央	1	女性	1966-06-03	52	226-0027	神奈川県横浜市緑区長津田**********	S14040	20150728	6-20090622-6
CS029402000041	浅利俊二	0	男性	1975-08-15	43	134-0013	東京都江戸川区江戸川**********	S12029	20150220	0-00000000-0
CS019712000008	梅本ヒカル	1	女性	1945-04-14	73	173-0037	東京都板橋区小茂根**********	S13019	20150613	0-00000000-0
CS015713000077	長沢結衣	1	女性	1947-10-09	71	136-0075	東京都江東区新砂**********	S13015	20150308	0-00000000-0
CS008515000014	野沢あさみ	1	女性	1959-06-09	59	157-0067	東京都世田谷区喜多見**********	S13008	20150219	9-20091212-B
CS031514000047	原口礼子	1	女性	1963-07-26	55	151-0064	東京都渋谷区上原**********	S13031	20150927	8-20090607-6
CS004313000412	春日瞳	1	女性	1984-07-26	34	176-0024	東京都練馬区中村**********	S13004	20170525	0-00000000-0
CS035513000155	板倉昌代	1	女性	1960-06-12	58	157-0075	東京都世田谷区砧公園**********	S13035	20150625	0-00000000-0
CS003512000587	大山沙知絵	1	女性	1959-04-01	59	214-0014	神奈川県川崎市多摩区登戸**********	S13003	20170306	0-00000000-0

Commentary : 

This SQL code selects a random subset of 10 customers from the customer table. The RANDOM() function generates a random value between 0 and 1 for each row in the table, and the WHERE clause filters the rows based on whether the generated value is less than or equal to 0.01 (1% chance of being selected). 

This means that each row has an equal chance of being selected, regardless of any other criteria or order in the table. The LIMIT clause is used to restrict the output to the first 10 rows that meet the condition.

In [ ]:

%%sql

-- Code example 2 (If you want to be polite. Create a count and fix the number of outputs)
WITH customer_tmp AS(
  SELECT
     *,  
     ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row,
     COUNT(*) OVER() AS cnt
  FROM 
    customer
)
SELECT 
  customer_id,
  customer_name,
  gender_cd,
  gender,
  birth_day,
  age,
  postal_cd,
  address,
  application_store_cd,
  application_date,
  status_cd
FROM 
  customer_tmp
WHERE 
  row <= cnt * 0.01
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
CS027514000069	土屋さやか	9	不明	1967-02-10	52	251-0033	神奈川県藤沢市片瀬山**********	S14027	20141213	A-20101018-B
CS031504000012	向井剛基	0	男性	1966-09-19	52	151-0062	東京都渋谷区元代々木町**********	S13031	20150310	0-00000000-0
CS028403000022	新垣三郎	0	男性	1973-09-12	45	246-0012	神奈川県横浜市瀬谷区東野**********	S14028	20150905	0-00000000-0
CS001415000103	水谷千夏	1	女性	1976-12-08	42	144-0051	東京都大田区西蒲田**********	S13001	20150509	A-20100925-C
CS019313000145	早美由美子	1	女性	1985-02-20	34	173-0033	東京都板橋区大山西町**********	S13019	20141226	0-00000000-0
CS002412000346	荒川美嘉	1	女性	1973-03-22	46	187-0045	東京都小平市学園西町**********	S13002	20160808	0-00000000-0
CS003415000271	稲垣遥	1	女性	1975-12-14	43	201-0001	東京都狛江市西野川**********	S13003	20160630	A-20090325-8
CS029502000052	岡崎優一	0	男性	1963-03-28	56	134-0085	東京都江戸川区南葛西**********	S12029	20150803	0-00000000-0
CS002515000290	大山みあ	1	女性	1959-06-05	59	185-0023	東京都国分寺市西元町**********	S13002	20160627	C-20100730-B
CS009314000030	西川奈々	1	女性	1983-05-15	35	158-0091	東京都世田谷区中町**********	S13009	20150519	E-20100910-D

Commentary :

This code is using the SQL programming language to select a random subset of customers from the "customer" table.

The code first creates a temporary table called "customer_tmp" by selecting all columns from the "customer" table and adding two additional columns using the ROW_NUMBER and COUNT functions.

The ROW_NUMBER function assigns a unique integer to each row in the table, based on the result of the ORDER BY RANDOM() clause. This effectively randomizes the order of the rows in the table.

The COUNT function with the OVER() clause counts the total number of rows in the table, and creates a new column called "cnt" that contains this count for each row.

The main SELECT statement then selects a subset of customers from the temporary table by filtering on the "row" column, which was created by the ROW_NUMBER function. Specifically, it selects the first 1% of rows, as determined by the "cnt" column created by the COUNT function.

The final clause, "LIMIT 10", limits the results to the first 10 rows of the subset, effectively returning a random sample of 10 customers from the original "customer" table.

Overall, this code is a simple way to select a random sample of customers from a large table for testing or exploratory analysis purposes. The sampling percentage (in this case, 1%) can be adjusted to create larger or smaller samples.

S-076: Extract 10% of the data stratified randomly from the customer data (df_customer) based on the percentage of gender code (gender_cd) and count the number of cases by gender code

In [ ]:

%%sql

-- Code Example 1
WITH cusotmer_random AS (
  SELECT
    customer_id,
    gender_cd,
    cnt
  FROM 
      (
        SELECT
          gender_cd,
          ARRAY_AGG(customer_id ORDER BY RANDOM()) AS customer_r,
          COUNT(1) AS cnt
        FROM
          customer
        GROUP BY 
          gender_cd
    )sample, UNNEST(customer_r) AS customer_id
),
cusotmer_rownum AS(
  SELECT
    * ,
    ROW_NUMBER() OVER(PARTITION BY gender_cd) AS rn
  FROM
    cusotmer_random
)
SELECT
  gender_cd,
  COUNT(1) AS customer_num
FROM
  cusotmer_rownum
WHERE
  rn <= cnt * 0.1
GROUP BY
  gender_cd
;

 * postgresql://padawan:***@db:5432/dsdojo_db
3 rows affected.

Out[ ]:

gender_cd	customer_num
0	298
1	1791
9	107

Commentary :

This SQL code is selecting a random sample of customers from a table called "customer" and then calculating the number of customers in the sample that fall into each gender category.

Here is a breakdown of what each part of the code is doing:

WITH clause: This clause creates two temporary tables, "customer_random" and "customer_rownum," that will be used in the subsequent SELECT statement.

customer_random table: This table selects a random sample of customers from the "customer" table and assigns each customer a "customer_r" value based on their gender. The "cnt" column counts the total number of customers in each gender group.

customer_rownum table: This table adds a row number ("rn") to each customer in the "customer_random" table, partitioned by gender.

SELECT statement: This statement counts the number of customers in each gender group whose row number is less than or equal to 10% of the total number of customers in that gender group (i.e., the top 10% of customers by row number). The result is a table with two columns, "gender_cd" and "customer_num", where "gender_cd" is the gender code and "customer_num" is the number of customers in that gender group that fall into the top 10% by row number.

Overall, this code is useful for selecting a random sample of customers from a larger dataset and then analyzing their gender distribution.

In [ ]:

%%sql

-- Code Example 2
WITH cusotmer_random AS (
SELECT 
  * , 
  ROW_NUMBER() OVER(PARTITION BY gender_cd ORDER BY RANDOM()) AS rn,
  COUNT(1) OVER(PARTITION BY gender_cd) cnt
FROM 
  customer
)
SELECT
  gender_cd,
  COUNT(1) AS customer_num
FROM
  cusotmer_random
WHERE 
  rn <= cnt * 0.1
GROUP BY
  gender_cd
;

 * postgresql://padawan:***@db:5432/dsdojo_db
3 rows affected.

Out[ ]:

gender_cd	customer_num
9	107
0	298
1	1791

Commentary :

This code is using the SQL programming language to randomly select a subset of customers from the "customer" table, grouping them by their gender and returning the number of customers in each gender group.

The code first creates a temporary table called "cusotmer_random" by selecting all columns from the "customer" table and adding two additional columns using the ROW_NUMBER and COUNT functions.

The ROW_NUMBER function assigns a unique integer to each row in the table, based on the result of the ORDER BY RANDOM() clause. This effectively randomizes the order of the rows in the table.

The PARTITION BY gender_cd clause in the ROW_NUMBER function ensures that the random ordering is done separately for each gender group.

The COUNT function with the OVER(PARTITION BY gender_cd) clause counts the total number of rows in each gender group, and creates a new column called "cnt" that contains this count for each row.

The main SELECT statement then selects a subset of customers from the temporary table by filtering on the "rn" column, which was created by the ROW_NUMBER function. Specifically, it selects the first 10% of rows in each gender group, as determined by the "cnt" column created by the COUNT function.

The WHERE clause filters the rows in the "cusotmer_random" table to only include those with an "rn" value less than or equal to 10% of the "cnt" value for their respective gender group.

Finally, the main SELECT statement groups the rows by gender_cd and returns the count of rows in each group, effectively giving the number of customers in each gender group in the random sample.

Overall, this code is a way to randomly select a subset of customers from a large table while ensuring that the sample includes a similar proportion of each gender group as the original table. The sampling percentage (in this case, 10%) can be adjusted to create larger or smaller samples.

S-077: Sum the sales amounts in the receipt details data (df_receipt) by customer unit and extract outliers of the summed sales amounts. The outliers should be calculated by logarithmising the total sales amount, calculating the mean and standard deviation, and then deviating from the mean by more than 3σ (either the natural logarithm or the ordinary logarithm is acceptable). Display 10 results.

In [ ]:

%%sql

WITH sales_amount AS(
  SELECT
    customer_id,
    SUM(amount) AS sum_amount,
    LN(SUM(amount) + 0.5) AS log_sum_amount
  FROM
    receipt
  GROUP BY 
    customer_id
)
SELECT 
  customer_id,
  sum_amount,
  log_sum_amount
FROM
  sales_amount
CROSS JOIN (
  SELECT
    AVG(log_sum_amount) AS avg_amount,
    STDDEV_POP(log_sum_amount) AS std_amount
  FROM 
    sales_amount   
) stats_amount
WHERE
  ABS(log_sum_amount - avg_amount) / std_amount > 3
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

customer_id	sum_amount	log_sum_amount
ZZ000000000000	12395003	16.332804005823312

Commentary :

This code is using the SQL programming language to identify customers whose total purchase amount falls significantly outside the norm, based on their logarithmically transformed purchase amount.

The code first creates a temporary table called "sales_amount" by selecting the customer_id and sum of purchase amount from the "receipt" table, grouped by customer_id. The sum of purchase amount is also transformed into a logarithmic scale using the LN function, which adds 0.5 to avoid taking the logarithm of zero.

The main SELECT statement then selects the customer_id, sum_amount, and log_sum_amount columns from the temporary table. It also uses a CROSS JOIN to join in another temporary table called "stats_amount", which is created by selecting the average and population standard deviation of the log_sum_amount column from the "sales_amount" table.

The WHERE clause filters the rows in the "sales_amount" table to only include those where the absolute difference between the log_sum_amount and the average log_sum_amount, divided by the standard deviation of the log_sum_amount, is greater than 3. This is a way to identify customers whose purchase amounts are more than three standard deviations from the mean, which is a common threshold for identifying outliers in statistical analysis.

Finally, the main SELECT statement limits the output to 10 rows.

Overall, this code is a way to identify customers who have made significantly large or small purchases, relative to the average purchase amount in the dataset, by transforming the purchase amounts into a logarithmic scale and applying a statistical outlier detection criterion.

S-078: Sum the sales amount (amount) of the receipt details data (df_receipt) in customer units and extract outliers of the summed sales amount. However, exclude those whose customer ID starts with “Z”, as they represent non-members. Outliers are defined as the difference between the first and third quartiles using IQR, and are defined as being below “first quartile – 1.5 x IQR” or above “third quartile + 1.5 x IQR”. Show 10 results.

In [ ]:

%%sql

WITH sales_amount AS(
  SELECT 
    customer_id,
    SUM(amount) AS sum_amount
  FROM
    receipt
  WHERE
    customer_id NOT LIKE 'Z%'
  GROUP BY 
    customer_id
)
SELECT
  customer_id,
  sum_amount
FROM
  sales_amount
CROSS JOIN (
  SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY sum_amount) AS amount_25per,
    PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY sum_amount) AS amount_75per
  FROM 
    sales_amount   
) stats_amount
WHERE
  sum_amount < amount_25per - (amount_75per - amount_25per) * 1.5 
OR 
  amount_75per + (amount_75per - amount_25per) * 1.5 < sum_amount
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
10 rows affected.

Out[ ]:

customer_id	sum_amount
CS013415000226	8362
CS011415000041	9454
CS014514000004	8872
CS021514000008	12839
CS014515000007	9763
CS040415000220	10158
CS028415000161	8465
CS034515000173	10074
CS022515000065	12903
CS007514000094	15735

Commentary :

This code is using the SQL programming language to query a database. Specifically, it is using a common table expression (CTE) to create a temporary table called "sales_amount", which calculates the total amount of sales for each customer in the "receipt" table.

The WHERE clause of the CTE filters out any customers whose ID starts with the letter "Z". The GROUP BY clause groups the sales by customer ID.

The main query then performs a CROSS JOIN between the "sales_amount" table and a subquery that calculates the 25th and 75th percentiles of the sales amounts using the PERCENTILE_CONT function. These percentiles are stored in a temporary table called "stats_amount".

Finally, the WHERE clause of the main query filters out any sales amounts that are more than 1.5 times the interquartile range (IQR) away from the 25th or 75th percentiles. This is a common method for identifying outliers in a dataset. The results are limited to the top 10 rows.

Overall, this code is querying a database to identify customers with unusually high or low sales amounts compared to their peers, based on their position in the interquartile range.

S-079: For each item of product data (df_product), check the number of missing items.

In [ ]:

%%sql

SELECT 
  SUM(CASE WHEN product_cd IS NULL THEN 1 ELSE 0 END) AS product_cd,
  SUM(CASE WHEN category_major_cd IS NULL THEN 1 ELSE 0 END) AS category_major_cd,
  SUM(CASE WHEN category_medium_cd IS NULL THEN 1 ELSE 0 END) AS category_medium_cd,
  SUM(CASE WHEN category_small_cd IS NULL THEN 1 ELSE 0 END) AS category_small_cd,
  SUM(CASE WHEN unit_price IS NULL THEN 1 ELSE 0 END) AS unit_price,
  SUM(CASE WHEN unit_cost IS NULL THEN 1 ELSE 0 END) AS unit_cost
FROM 
  product 
LIMIT 
  10
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost
0	0	0	0	7	7

Commentary :

This code is using the SQL programming language to query a database. Specifically, it is querying the "product" table and using a combination of the SUM and CASE statements to count the number of missing values (NULLs) in each column of the table.

The query selects six columns and gives them aliases that match the names of the columns in the table: "product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price", and "unit_cost".

For each column, the CASE statement checks whether the value is NULL, and returns 1 if it is, or 0 otherwise. The SUM function then adds up all the 1s and 0s to give the total number of NULL values in each column.

The LIMIT 10 clause limits the results to the first 10 rows of the table, which is not particularly meaningful in this case since we are only interested in the summary statistics.

Overall, this code is a simple way to check for missing data in the "product" table and get a quick summary of the number of missing values in each column.

S-080: Create a new product data, deleting all records where any item in the product data (df_product) is missing. Note that the number of items before and after the deletion should be displayed, and also confirm that the number of items has decreased by the number of items checked in 079.

In [ ]:

%%sql

DROP TABLE IF EXISTS product_1;
CREATE TABLE product_1 AS (
    SELECT 
      * 
    FROM 
      product
    WHERE 
      unit_price IS NOT NULL 
    AND 
      unit_cost IS NOT NULL
);

 * postgresql://padawan:***@db:5432/dsdojo_db
Done.
10023 rows affected.

Out[ ]:

[]

In [ ]:

%%sql

SELECT 
  'Before deletion', 
  COUNT(1) 
FROM 
  product
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

?column?	count
Before deletion	10030

In [ ]:

%%sql

SELECT 
  'After deletion', 
  COUNT(1)
FROM 
  product_1
;

 * postgresql://padawan:***@db:5432/dsdojo_db
1 rows affected.

Out[ ]:

?column?	count
Before deletion	10023

Commentary :

This code is using the SQL programming language to create a new table called "product_1" by selecting only the rows from the "product" table where both the "unit_price" and "unit_cost" columns are not NULL.

The first line of the code, "DROP TABLE IF EXISTS product_1;", is a safety measure that checks whether the "product_1" table already exists and drops it if it does, to avoid any conflicts with the new table being created.

The second line, "CREATE TABLE product_1 AS (...);", specifies that a new table called "product_1" should be created, and that its contents should be based on the results of a SELECT statement.

The SELECT statement selects all columns from the "product" table, but includes a WHERE clause that filters out any rows where either the "unit_price" or "unit_cost" column is NULL.

Overall, this code creates a new table that contains only the rows from the original "product" table that have valid data for both the "unit_price" and "unit_cost" columns. This is a common way to clean up data and prepare it for analysis or modeling, as missing data can cause issues with some analytical techniques.

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - SQL part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - SQL part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - SQL part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - SQL part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - SQL part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.