Data Science 100 Knocks (Structured Data Processing) – SQL part5 (Q81 to Q100)

Articles in English
Commentary :

This code is written in SQL language and it performs the following steps:

If a table named product_2 already exists in the database, it will be dropped. This is done to ensure that we start with a clean slate.

A new table named product_2 is created using the CREATE TABLE statement.

The data for the new table is derived from the product table using a SELECT statement.

In the SELECT statement, the columns product_cd, category_major_cd, category_medium_cd, and category_small_cd are selected from the product table.

Two new columns are created in the SELECT statement using the COALESCE function. COALESCE returns the first non-null value in a list of expressions. In this case, if unit_price is null, the average unit price of all products (unit_avg) is used instead. Similarly, if unit_cost is null, the average unit cost of all products (cost_avg) is used instead.

The FROM clause in the SELECT statement performs a CROSS JOIN with a subquery that calculates the average unit price and unit cost of all products in the product table. The ROUND function is used to round the average values to the nearest integer.

The result of the SELECT statement is used to populate the product_2 table using the INSERT INTO statement that is implicitly executed by the CREATE TABLE statement.

In summary, the code creates a new table named product_2 that contains the same columns as the product table, but with null values in the unit_price and unit_cost columns replaced by the average unit price and unit cost values, respectively, calculated across all products.
 
 
Commentary :

This code is written in SQL language and it performs the following steps:

If a table named product_3 already exists in the database, it will be dropped. This is done to ensure that we start with a clean slate.

A new table named product_3 is created using the CREATE TABLE statement.

The data for the new table is derived from the product table using a SELECT statement.

In the SELECT statement, the columns product_cd, category_major_cd, category_medium_cd, and category_small_cd are selected from the product table.

Two new columns are created in the SELECT statement using the COALESCE function. COALESCE returns the first non-null value in a list of expressions. In this case, if unit_price is null, the median unit price of all products (unit_med) is used instead. Similarly, if unit_cost is null, the median unit cost of all products (cost_med) is used instead.

The FROM clause in the SELECT statement performs a CROSS JOIN with a subquery that calculates the median unit price and unit cost of all products in the product table. The PERCENTILE_CONT function is used to calculate the median values. The WITHIN GROUP clause specifies the ordering of the rows, so that the function knows which values to use to calculate the median.

The ROUND function is used to round the median values to the nearest integer.

The result of the SELECT statement is used to populate the product_3 table using the INSERT INTO statement that is implicitly executed by the CREATE TABLE statement.

In summary, the code creates a new table named product_3 that contains the same columns as the product table, but with null values in the unit_price and unit_cost columns replaced by the median unit price and unit cost values, respectively, calculated across all products. The median values are calculated using the PERCENTILE_CONT function with a WITHIN GROUP clause that specifies the ordering of the rows.
 
 
 
Commentary :

This code is written in SQL language and it performs the following steps:

A DROP TABLE statement is executed to drop the product_4 table if it already exists.

A CREATE TABLE statement is executed to create the product_4 table.

A common table expression (CTE) named category_median is defined. This CTE uses a SELECT statement with an ORDER BY clause to sort the rows in the product table by the unit_price and unit_cost columns within each category, and then calculates the median values of unit_price and unit_cost for each category using the PERCENTILE_CONT function. The GROUP BY clause groups the data by category_small_cd column.

The main SELECT statement retrieves data from the product table and joins it with the category_median CTE on the category_small_cd column. The COALESCE function is used to replace null values in the unit_price and unit_cost columns with the median values for their corresponding categories.

Finally, the resulting data is inserted into the product_4 table using the CREATE TABLE AS syntax.

In summary, this code creates a new table product_4 by joining the product table with a CTE category_median. The median values of unit_price and unit_cost are calculated for each category in the product table using the PERCENTILE_CONT function. The resulting table product_4 contains data from the product table with null values in the unit_price and unit_cost columns replaced by the median values for their corresponding categories.
 
 
Commentary :

This code is also written in SQL and it performs the following steps:

A SELECT statement is executed to retrieve data from the receipt and customer tables.

Two subqueries are created to calculate the total sales amount for each customer in the year 2019 and for all years.

The SELECT statement joins the customer table with the two subqueries using the LEFT JOIN operator. This ensures that all customers in the customer table are included in the output, even if they have no sales data in the receipt table.

The COALESCE function is used to handle cases where a customer has no sales data in the receipt table. It replaces NULL values with 0.

The CASE statement calculates the sales rate for each customer. It checks if the total sales amount for all years is 0, in which case it returns 0. Otherwise, it divides the sales amount for the year 2019 by the total sales amount for all years.

The results are inserted into a new table called sales_rate.

In summary, this code calculates the sales rate for each customer based on their sales data from the receipt table. The sales rate is the ratio of the sales amount for the year 2019 to the total sales amount for all years. The results are saved in a new table called sales_rate.
 
 
Commentary :

This code is written in SQL and it performs the following steps:

A WITH clause is used to define a subquery that calculates the average longitude and latitude for each postal code using the geocode table.

The SELECT statement joins the customer table with the subquery defined in the WITH clause using the postal_cd column.

The result of the join operation is then inserted into a new table called customer_1 using the CREATE TABLE statement. If a table with the same name already exists, it is first dropped using the DROP TABLE IF EXISTS statement.

In summary, this code creates a new table called customer_1 by joining the customer table with the geocode table using the postal_cd column and appending the average longitude and latitude for each postal code to the customer table. The resulting table has all the columns from the customer table along with two additional columns for the average longitude and latitude.
 
 
Commentary :

This SQL code creates a new table named "customer_u" by joining data from two tables "customer" and "receipt". It first defines a Common Table Expression (CTE) named "sales_amount" which computes the total sales amount for each customer by joining the "customer" table with the "receipt" table using a left join. The CTE also uses the COALESCE function to return 0 for customers who have no sales.

Then another CTE named "sales_ranking" is defined by computing the sales ranking for each customer based on the total sales amount, within each customer_name and postal_cd group. The ROW_NUMBER() function is used to assign a rank to each customer based on the sum_amount in descending order.

Finally, the SELECT statement joins the "customer" table with the "sales_ranking" CTE, filtering only the rows where the ranking is 1. This effectively selects only the top-ranking customer for each customer_name and postal_cd group. The resulting data is then inserted into a new table named "customer_u".
 
 
Commentary :

This SQL code creates a new table called customer_n by joining the existing customer and customer_u tables.

The JOIN condition matches rows between customer and customer_u tables with the same values for customer_name and postal_cd columns. For these matching rows, the integration_id column of the resulting table customer_n is set to the customer_id from the customer_u table.

Therefore, the customer_n table includes all the rows from the customer table, along with an additional integration_id column that contains the customer_id values from the customer_u table. Rows that do not have matching customer_name and postal_cd values in the customer_u table will have a NULL value in the integration_id column.

 

 
Commentary :

This code is using SQL to create a training and testing dataset for a machine learning model based on customer sales data. Here is a breakdown of what each line does:

%%sql SELECT SETSEED(0.1);
This sets the seed for the random number generator used in the query. This is done to ensure that the same results are obtained each time the query is run.

CREATE TEMP TABLE IF NOT EXISTS sales_customer AS ( SELECT customer_id, ROW_NUMBER() OVER(ORDER BY RANDOM()) AS row FROM customer JOIN receipt USING(customer_id) GROUP BY customer_id HAVING SUM(AMOUNT) > 0 );

This creates a temporary table called sales_customer that contains each customer's customer_id and a randomly assigned row number. It selects data from two tables: customer and receipt using JOIN. It groups data by customer_id and filters the result to only include customers who have made a purchase (SUM(AMOUNT) > 0).

DROP TABLE IF EXISTS customer_train; CREATE TABLE customer_train AS SELECT customer.* FROM sales_customer JOIN customer USING(customer_id) WHERE sales_customer.row <= (SELECT COUNT(1) FROM sales_customer) * 0.8 ;

This creates a table called customer_train that contains 80% of the customers in the sales_customer table. It selects all columns from the customer table for customers whose row number in the sales_customer table is less than or equal to 80% of the total number of rows in the sales_customer table.

DROP TABLE IF EXISTS customer_test; CREATE TABLE customer_test AS SELECT customer.* FROM sales_customer JOIN customer USING(customer_id) EXCEPT SELECT * FROM customer_train ;

This creates a table called customer_test that contains the remaining 20% of the customers from the sales_customer table. It selects all columns from the customer table for customers whose customer_id is in the sales_customer table, but not in the customer_train table. This ensures that the test set contains customers that are not in the training set.
 
 
Commentary :

This code creates three new tables (series_data_1, series_data_2, and series_data_3) by applying a window function to a summary table (ts_amount) of sales data. Here is a breakdown of what each line does:

%%sql CREATE TEMP TABLE IF NOT EXISTS ts_amount AS
( SELECT SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6) AS sales_ym, SUM(amount) AS sum_amount, ROW_NUMBER() OVER(ORDER BY SUBSTR(CAST(sales_ymd AS VARCHAR), 1, 6)) AS rn FROM receipt GROUP BY
sales_ym );

This code creates a temporary table (ts_amount) that summarizes the sales data by month (using the SUBSTR function to extract the year and month from the sales_ymd column), calculates the total amount of sales (SUM(amount)), and assigns a row number to each row (ROW_NUMBER() OVER(...) AS rn). The row numbers are used in the subsequent queries to join the data across multiple time periods.

DROP TABLE IF EXISTS series_data_1 ; CREATE TABLE series_data_1 AS ( WITH lag_amount AS
( SELECT sales_ym, sum_amount, LAG(rn, 0) OVER (ORDER BY rn) AS rn FROM
ts_amount ) SELECT sales_ym, sum_amount, CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg FROM
lag_amount WHERE
rn BETWEEN 1 AND 18);

This code creates a new table (series_data_1) by joining the sales data from the current period with the sales data from the previous period (using a window function and the LAG function), and assigning a binary flag (test_flg) to each row based on whether it belongs to the training set (test_flg = 0) or the test set (test_flg = 1). This is done by checking the row number (rn) against a threshold value (12), which splits the data into two parts: the first 12 rows are used for training, and the remaining 6 rows are used for testing. The resulting table has 18 rows in total. The code then repeats this process two more times, with different time periods:

DROP TABLE IF EXISTS series_data_2 ; CREATE TABLE series_data_2 AS ( WITH lag_amount AS ( SELECT sales_ym, sum_amount, LAG(rn, 6) OVER (ORDER BY rn) AS rn FROM ts_amount ) SELECT sales_ym, sum_amount, CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg FROM lag_amount WHERE rn BETWEEN 1 AND 18); DROP TABLE IF EXISTS series_data_3 ; CREATE TABLE series_data_3 AS ( WITH lag_amount AS ( SELECT sales_ym, sum_amount, LAG(rn, 12) OVER (ORDER BY rn) AS rn FROM ts_amount ) SELECT sales_ym, sum_amount, CASE WHEN rn <= 12 THEN 0 WHEN 12 < rn THEN 1 END as test_flg FROM lag_amount WHERE rn BETWEEN 1 AND 18);

series_data_2 uses sales data from six months ago as the reference period, while series_data_3 uses sales data from one year ago.
 
 
Commentary :

This code is a SQL query that performs downsampling of customer data. Downsampling is a process of randomly removing some data points from a dataset to balance the classes or reduce the dataset size.

In the query, the first statement sets the random seed value. The next statement creates a temporary table called down_sampling that includes the customer data and the sum of amounts for each customer. The LEFT JOIN clause in the first CTE combines the customer table with the receipt table, calculating the sum of amounts for each customer.

In the second CTE, two additional columns are created using the CASE statement. These columns are flags that indicate whether a customer has made a purchase or not. Then, in the third CTE, the ROW_NUMBER() window function is used to assign a row number to each customer record based on the is_buy_flag column.

Finally, in the last statement of the query, the temporary table down_sampling is selected, but only the rows where the row number is less than or equal to the number of buying customers and the number of non-buying customers. This ensures that the number of rows for each class is balanced in the resulting table.
 
 
Commentary :

This SQL code creates two new tables from an existing table named customer.

The first part of the code drops a table named customer_std if it exists and then creates a new table named customer_std. The new table has the same columns as the customer table, which are customer_id, customer_name, gender_cd, birth_day, age, postal_cd, application_store_cd, application_date, and status_cd. The purpose of creating this new table is not clear from the code provided. It might be created to standardize or preprocess the data.

The second part of the code drops a table named gender_std if it exists and then creates a new table named gender_std. The new table has two columns named gender_cd and gender. The gender_cd column is a distinct list of all gender codes found in the customer table, while the gender column is the corresponding gender description. This table might be used for mapping the gender codes to their respective gender descriptions.
 
Commentary : 

This code selects the first 3 rows from the customer_std table, which was created earlier in the code. The customer_std table is a copy of the customer table with a subset of columns, so this code is simply displaying the values of those columns for the first three customers in the table. This is a basic check to confirm that the data has been correctly imported into the new table.
 
 
Commentary :

This code creates a new table called "product_full" by joining two tables "product" and "category".

The "product" table contains information about each product, such as its product code, category codes, unit price, and unit cost.

The "category" table contains category codes and category names for each category level.

The "JOIN" clause connects the two tables based on the category_small_cd column. By using the "USING" keyword, the query matches rows where the values in category_small_cd are equal in both tables.

The new table "product_full" includes all the columns from the "product" table, plus additional columns for category names, which are obtained by looking up the category codes in the "category" table.
 
 
Commentary :

This SQL code drops the table product_full if it exists and creates a new table with the same name. The new table has eight columns: product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, and unit_cost. The first seven columns are of type VARCHAR with varying lengths, while the last two columns are of type INTEGER.

By specifying the data types and lengths of the columns, the code is defining the schema of the product_full table. This ensures that each column has a specific data type and length, which can improve the performance of queries on this table, as well as prevent errors that may occur if data of the wrong type or length is inserted.
 
 
Commentary :

This code imports data from a CSV file located at '/tmp/data/S_product_full_UTF-8_header.csv' into a new table called 'product_full' that has already been created in the database.

The COPY command in PostgreSQL allows you to copy data from a file to a table or vice versa. The FROM keyword is used to specify the path to the CSV file, and the WITH keyword is used to provide options for the COPY command. In this case, CSV specifies that the file is in Comma-Separated Value format, HEADER specifies that the first row of the file contains column names, and ENCODING 'UTF-8' specifies the character encoding used in the file.

Since the table 'product_full' has already been created in the database, the COPY command inserts the data from the CSV file into the existing table.
 
 
Commentary :

This code drops the table named product_full if it exists, and creates a new table with the same name and schema. The new table has 9 columns:

product_cd column is of type VARCHAR(10)
category_major_cd column is of type VARCHAR(2)
category_major_name column is of type VARCHAR(20)
category_medium_cd column is of type VARCHAR(4)
category_medium_name column is of type VARCHAR(20)
category_small_cd column is of type VARCHAR(6)
category_small_name column is of type VARCHAR(20)
unit_price column is of type INTEGER
unit_cost column is of type INTEGER

The VARCHAR data type is used to store variable-length character strings with the specified maximum length. The INTEGER data type is used to store whole numbers.

This code creates an empty table with the specified schema, and no data is added to the table.
 
 
Commentary :

This code is importing data from a CSV file named 'S_product_full_UTF-8_noh.csv' into the table 'product_full' using the COPY command in PostgreSQL.

The WITH CSV option specifies that the data is in CSV format. The ENCODING option specifies the character encoding of the file, which is UTF-8 in this case.

Since the file doesn't have a header row, the column names and types are specified explicitly in a preceding SQL statement which creates the 'product_full' table.

Once the data is imported into the table, the SELECT statement is used to retrieve the first 3 rows of the 'product_full' table to confirm that the data has been imported correctly.
 
Commentary :

This SQL code block first drops the table named product_full if it exists, and then creates a new table with the same name and schema.

The schema of the product_full table has 9 columns with the following data types:

product_cd: VARCHAR(10) (i.e., a variable-length character string with a maximum length of 10)
category_major_cd: VARCHAR(2)
category_major_name: VARCHAR(20)
category_medium_cd: VARCHAR(4)
category_medium_name: VARCHAR(20)
category_small_cd: VARCHAR(6)
category_small_name: VARCHAR(20)
unit_price: INTEGER
unit_cost: INTEGER

The purpose of this code block is to define the structure of the product_full table before data is loaded into it.
 
 
Commentary :

This code copies data from a TSV (tab-separated values) file located at '/tmp/data/S_product_full_UTF-8_header.tsv' and inserts it into the 'product_full' table in the current database.

The COPY command specifies that the input file is a CSV file, with headers included in the first row of the file, and that the delimiter used is a tab character ('\t'). The ENCODING parameter specifies that the file is encoded in UTF-8.

This command assumes that the 'product_full' table already exists and has the same structure as the data in the input file (product_cd, category_major_cd, category_major_name, category_medium_cd, category_medium_name, category_small_cd, category_small_name, unit_price, unit_cost). The command will insert the data from the file into the table, creating new rows in the table for each row of data in the file.
 
 
Commentary :

This code executes a SQL query to select the first three rows from the product_full table. The SELECT statement is followed by *, which means to select all columns from the table. LIMIT 3 limits the result set to the first three rows of the table.

The result will display the data in the product_full table according to the columns specified in the table. The exact output will depend on the data stored in the table.
 

 

Data Science 100 Knocks (Structured Data Processing) - SQL
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - SQL part1 (Q1 to Q20)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - SQL part2 (Q21 to Q40)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - SQL part3 (Q41 to Q60)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - SQL part4 (Q61 to Q80)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - SQL part5 (Q81 to Q100)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Comment