参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_R.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction

To begin with, run the following cells
Import the required libraries and read the data from the database (PostgreSQL)
The libraries that are expected to be used are imported in the following cells
If there are other libraries you wish to use, install them as appropriate using install.packages().
Names, addresses, etc. are dummy data and are not real.

In [ ]:

require("RPostgreSQL") 
require("tidyr") 
require("dplyr") 
require("stringr") 
require("caret") 
require("lubridate") 
require("rsample") 
require("recipes") 
require("themis") 

host <- "db" 
port <- Sys.getenv()["PG_PORT"] 
dbname <- Sys.getenv()["PG_DATABASE"] 
user <- Sys.getenv()["PG_USER"] 
password <- Sys.getenv()["PG_PASSWORD"] 

con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password) 
df_customer <- dbGetQuery(con,"SELECT * FROM customer") 
df_category <- dbGetQuery(con,"SELECT * FROM category") 
df_product <- dbGetQuery(con,"SELECT * FROM product") 
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt") 
df_store <- dbGetQuery(con,"SELECT * FROM store") 
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")

Loading required package: RPostgreSQL

Loading required package: DBI

Loading required package: tidyr

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: stringr

Loading required package: caret

Loading required package: ggplot2

Loading required package: lattice

Loading required package: lubridate


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


Loading required package: rsample

Loading required package: recipes


Attaching package: ‘recipes’


The following object is masked from ‘package:stringr’:

    fixed


The following object is masked from ‘package:stats’:

    step


Loading required package: themis

Registered S3 methods overwritten by 'themis':
  method                  from   
  bake.step_downsample    recipes
  bake.step_upsample      recipes
  prep.step_downsample    recipes
  prep.step_upsample      recipes
  tidy.step_downsample    recipes
  tidy.step_upsample      recipes
  tunable.step_downsample recipes
  tunable.step_upsample   recipes


Attaching package: ‘themis’


The following objects are masked from ‘package:recipes’:

    step_downsample, step_upsample

Practice Questions

R-061: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to ordinary logarithm (bottom 10) and display 10 items with customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_receipt %>% 
 filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") %>% 
 mutate(log_amount = log((sum_amount + 0.5), base = 10)) %>% 
 slice(1:10)

A tibble: 10 × 3

customer_id	sum_amount	log_amount
<chr>	<int>	<dbl>
CS001113000004	1298	3.113442
CS001114000005	626	2.796921
CS001115000010	3044	3.483516
CS001205000004	1988	3.298526
CS001205000006	3337	3.523421
CS001211000025	456	2.659441
CS001212000027	448	2.651762
CS001212000031	296	2.472025
CS001212000046	228	2.358886
CS001212000070	456	2.659441

Commentary :

This code performs the following operations on the df_receipt data frame:

It filters out the rows where the customer_id starts with "Z" using the filter function.

It groups the data by customer_id using the group_by function.

It summarizes the amount column for each customer_id by taking the sum using the summarise function.

It drops the groups created in step 2 using the .groups = "drop" argument.

It creates a new column called log_amount by taking the logarithm of sum_amount plus 0.5, using base 10.

It selects the first 10 rows of the resulting data frame using the slice function.

Overall, this code creates a new column with the logarithm of the total amount spent by each customer in the df_receipt data frame, which is a common transformation used to normalize skewed data.

R-062: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to the natural logarithm (bottom e) and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_receipt %>% 
 filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") %>% 
 mutate(log_amount = log(sum_amount + 0.5)) %>% 
slice(1:10)

A tibble: 10 × 3

customer_id	sum_amount	log_amount
<chr>	<int>	<dbl>
CS001113000004	1298	7.168965
CS001114000005	626	6.440149
CS001115000010	3044	8.021092
CS001205000004	1988	7.595136
CS001205000006	3337	8.112977
CS001211000025	456	6.123589
CS001212000027	448	6.105909
CS001212000031	296	5.692047
CS001212000046	228	5.431536
CS001212000070	456	6.123589

Commentary :

This code processes a data frame called df_receipt. It does the following steps:

Filters out rows where the customer_id starts with "Z". This is done using the filter() function from the dplyr package with the !grepl() function as a condition.

Groups the remaining rows by customer_id, using the group_by() function from the dplyr package.

Calculates the total amount of each customer's purchases using the sum() function on the amount column. This is done using the summarise() function from the dplyr package.

Adds a new column called log_amount that contains the natural logarithm of the sum_amount column plus 0.5. This is done using the mutate() function from the dplyr package and the log() function from base R.

Finally, it selects the first 10 rows of the resulting data frame using the slice() function from the dplyr package.

R-063: Calculate the profit amount of each product from the unit price(unit price) and cost of the product(unit_cost) data (df_product) and display 10 results.

In [ ]:

df_product %>% 
mutate(unit_profit = unit_price - unit_cost) %>% 
slice(1:10)

A data.frame: 10 × 7

product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	unit_profit
<chr>	<chr>	<chr>	<chr>	<int>	<int>	<int>
P040101001	04	0401	040101	198	149	49
P040101002	04	0401	040101	218	164	54
P040101003	04	0401	040101	230	173	57
P040101004	04	0401	040101	248	186	62
P040101005	04	0401	040101	268	201	67
P040101006	04	0401	040101	298	224	74
P040101007	04	0401	040101	338	254	84
P040101008	04	0401	040101	420	315	105
P040101009	04	0401	040101	498	374	124
P040101010	04	0401	040101	580	435	145

Commentary :

This code takes the data frame df_product and creates a new column unit_profit using the mutate() function. The unit_profit column is calculated as the difference between unit_price and unit_cost, which are existing columns in df_product.

The slice() function is then used to show the first 10 rows of the resulting data frame with the new unit_profit column included.

R-064: Calculate the overall average profit margin for each product from the unit price (unit_price) and cost (unit_cost) of the product data (df_product). Note, however, that unit_price and cost are missing.

In [ ]:

df_product %>% 
mutate(unit_profit_rate = (unit_price - unit_cost) / unit_price) %>% 
summarise(total_mean = mean(unit_profit_rate, na.rm = TRUE))

A data.frame: 1 × 1

total_mean
<dbl>
0.2491139

Commentary :

This code is performing some calculations on a dataframe named df_product.

mutate is used to create a new column named unit_profit_rate. This column is calculated by subtracting the unit_cost from the unit_price and then dividing the result by unit_price.

The summarise function is then used to compute the mean of the unit_profit_rate column. The result is assigned to a new column named total_mean.

In summary, this code is calculating the average profit rate for each unit of product sold in df_product.

R-065: For each product in the product data (df_product), find the new unit price that gives a profit margin of 30%. However, round down to the nearest yen. Then display 10 results and confirm that the profit margin is approximately around 30%. Note that there is a deficit in unit_price（unit_price） and unit_cost（unit_cost）.

In [ ]:

df_product[c("product_cd", "unit_price", "unit_cost")] %>% 
mutate(new_price = trunc(unit_cost / 0.7)) %>% 
mutate(new_profit_rate = (new_price - unit_cost)/ new_price) %>% slice(1:10)

A data.frame: 10 × 5

product_cd	unit_price	unit_cost	new_price	new_profit_rate
<chr>	<int>	<int>	<dbl>	<dbl>
P040101001	198	149	212	0.2971698
P040101002	218	164	234	0.2991453
P040101003	230	173	247	0.2995951
P040101004	248	186	265	0.2981132
P040101005	268	201	287	0.2996516
P040101006	298	224	320	0.3000000
P040101007	338	254	362	0.2983425
P040101008	420	315	450	0.3000000
P040101009	498	374	534	0.2996255
P040101010	580	435	621	0.2995169

Commentary :

This code performs the following steps:

It selects only the columns "product_cd", "unit_price", and "unit_cost" from the data frame "df_product".

It creates a new column called "new_price" by dividing the unit cost by 0.7 and truncating the result to the nearest integer. This is an arbitrary calculation intends to lower the price of the product by a certain percentage.

It creates another new column called "new_profit_rate" by subtracting the unit cost from the new price and dividing the result by the new price. This is meant to calculate the profit rate for the new price.

Finally, it selects the first 10 rows of the resulting data frame.

R-066: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round to the nearest yen (rounding or rounding to even numbers is fine). Then display 10 results and check that the profit margin is around 30%. Note, however, that there are deficiencies in unit_price（unit_price） and unit_cost（unit_cost）.

In [ ]:

df_product[c("product_cd", "unit_price", "unit_cost")] %>% 
mutate(new_price = round(unit_cost / 0.7)) %>% 
mutate(new_profit_rate = (new_price - unit_cost) / new_price) %>% 
slice(1:10)

A data.frame: 10 × 5

product_cd	unit_price	unit_cost	new_price	new_profit_rate
<chr>	<int>	<int>	<dbl>	<dbl>
P040101001	198	149	213	0.3004695
P040101002	218	164	234	0.2991453
P040101003	230	173	247	0.2995951
P040101004	248	186	266	0.3007519
P040101005	268	201	287	0.2996516
P040101006	298	224	320	0.3000000
P040101007	338	254	363	0.3002755
P040101008	420	315	450	0.3000000
P040101009	498	374	534	0.2996255
P040101010	580	435	621	0.2995169

Commentary :

This code is performing some calculations on the product data frame (df_product).

First, it is selecting three columns, "product_cd", "unit_price", and "unit_cost".

Then, it is creating a new column, "new_price", which is calculated by dividing "unit_cost" by 0.7 (i.e., reducing the cost by 30%) and rounding the result to the nearest whole number using the round() function.

Next, it creates a new column, "new_profit_rate", which is calculated by subtracting "unit_cost" from "new_price" and dividing the result by "new_price".

Finally, it selects the first 10 rows of the resulting data frame and displays them.

This code essentially calculates the new price and profit rate for each product if the cost of production is reduced by 30%.

R-067: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round up to the nearest yen. Then display 10 results and check that the profit margin is around 30%. Note, however, that there is a deficit in the unit price (unit_price) and cost (unit_cost).

In [ ]:

df_product[c("product_cd", "unit_price", "unit_cost")] %>% 
mutate(new_price = ceiling(unit_cost / 0.7)) %>% 
mutate(new_profit_rate = (new_price - unit_cost) / new_price) %>% 
slice(1:10)

A data.frame: 10 × 5

product_cd	unit_price	unit_cost	new_price	new_profit_rate
<chr>	<int>	<int>	<dbl>	<dbl>
P040101001	198	149	213	0.3004695
P040101002	218	164	235	0.3021277
P040101003	230	173	248	0.3024194
P040101004	248	186	266	0.3007519
P040101005	268	201	288	0.3020833
P040101006	298	224	320	0.3000000
P040101007	338	254	363	0.3002755
P040101008	420	315	451	0.3015521
P040101009	498	374	535	0.3009346
P040101010	580	435	622	0.3006431

Commentary :

This code manipulates the df_product data frame by adding two new variables new_price and new_profit_rate.

mutate(new_price = ceiling(unit_cost / 0.7)) creates a new column new_price that contains the result of the expression ceiling(unit_cost / 0.7). The ceiling() function rounds the result of the expression unit_cost / 0.7 up to the nearest integer. This results in a new price that is 30% higher than the unit cost.

mutate(new_profit_rate = (new_price - unit_cost) / new_price) creates a new column new_profit_rate that contains the result of the expression (new_price - unit_cost) / new_price. This expression calculates the new profit rate, which is the percentage of the new price that represents profit. This is calculated as the difference between the new price and the unit cost, divided by the new price.

The slice(1:10) function is used to display the first 10 rows of the resulting data frame.

R-068: For each product in the product data (df_product), find the amount including tax at a consumption tax rate of 10%, rounding down fractions of a yen and displaying 10 results. Note that the unit price (unit_price) is missing.

In [ ]:

head(cbind(product_cd = df_product$product_cd,
                        df_product$unit_price, 
                        tax_price = trunc(df_product$unit_price * 1.1)), 10)

A matrix: 10 × 3 of type chr

product_cd	unit_price	tax_price
P040101001	198	217
P040101002	218	239
P040101003	230	253
P040101004	248	272
P040101005	268	294
P040101006	298	327
P040101007	338	371
P040101008	420	462
P040101009	498	547
P040101010	580	638

Commentary :

This code uses the cbind function to combine three columns of df_product data frame, i.e., product_cd, unit_price, and tax_price, into a new data frame. The head function is then used to display the first ten rows of this new data frame.

The first column of the new data frame is labeled product_cd and contains the values of the product_cd column from df_product. The second column is labeled df_product$unit_price and contains the values of the unit_price column from df_product.

The third column is labeled tax_price and is created by applying the trunc function to the result of multiplying each value of the unit_price column by 1.1. This calculates the price of the product including a 10% tax.

Note that the use of df_product$ in the column name for the second column is unnecessary as the cbind function already extracts the unit_price column from df_product.

R-069: Combine receipt details data (df_receipt) and product data (df_product), calculate the total sales value of all products for each customer and the total sales value of products whose category major classification code (category_major_cd) is “07” (bottled canned food), and find the ratio between the two. Only customers with a sales record for category major category code “07” (bottled canned food) should be selected, and 10 results should be displayed.

In [ ]:

# Code example 1 
df_tmp_1 <- df_receipt %>% 
 group_by(customer_id) %>% 
 summarise(sum_all=sum(amount)) 

df_tmp_2 <- inner_join(df_receipt, df_product[c("product_cd","category_major_cd")], by="product_cd") %>% 
 filter(category_major_cd == "07") %>% group_by(customer_id) %>% 
 summarise(sum_07 = sum(amount), .groups = "drop") 

inner_join(df_tmp_1, df_tmp_2, by = "customer_id") %>% 
mutate(sales_rate = sum_07 / sum_all) %>% 
slice(1:10)

A tibble: 10 × 4

customer_id	sum_all	sum_07	sales_rate
<chr>	<int>	<int>	<dbl>
CS001113000004	1298	1298	1.0000000
CS001114000005	626	486	0.7763578
CS001115000010	3044	2694	0.8850197
CS001205000004	1988	346	0.1740443
CS001205000006	3337	2004	0.6005394
CS001212000027	448	200	0.4464286
CS001212000031	296	296	1.0000000
CS001212000046	228	108	0.4736842
CS001212000070	456	308	0.6754386
CS001213000018	243	145	0.5967078

Commentary :

This code performs the following operations:

Compute the total amount spent by each customer in the df_receipt data frame, and store it in df_tmp_1.

Join df_receipt and df_product data frames by product_cd, and filter out only the rows where category_major_cd is equal to "07".

Compute the sum of the amount column for each customer in the filtered data frame, and store it in df_tmp_2.

Join df_tmp_1 and df_tmp_2 data frames by customer_id.

Compute the sales rate of category major code "07" for each customer, and store it in a new column named sales_rate.

Slice the resulting data frame to show the first 10 rows.

The inner_join function is used to merge data frames on a common variable. The mutate function adds a new column to the data frame, which is calculated based on existing columns. The slice function is used to extract a subset of rows from the data frame.

In [ ]:

# Code example 2 (%>% to link all processes) 

inner_join(df_receipt, df_product[c("product_cd","category_major_cd")], by="product_cd") %>% filter(category_major_cd == "07") %>% 

group_by(customer_id) %>% 
summarise(sum_07 = sum(amount), .groups = "drop") %>% 
inner_join(df_receipt, by="customer_id") %>% 
group_by(customer_id) %>% 
summarise(sum_all=sum(amount), sum_07=max(sum_07), .groups = "drop") %>% 
mutate(sales_rate = sum_07 / sum_all) %>% 
slice(1:10)

A tibble: 10 × 4

customer_id	sum_all	sum_07	sales_rate
<chr>	<int>	<int>	<dbl>
CS001113000004	1298	1298	1.0000000
CS001114000005	626	486	0.7763578
CS001115000010	3044	2694	0.8850197
CS001205000004	1988	346	0.1740443
CS001205000006	3337	2004	0.6005394
CS001212000027	448	200	0.4464286
CS001212000031	296	296	1.0000000
CS001212000046	228	108	0.4736842
CS001212000070	456	308	0.6754386
CS001213000018	243	145	0.5967078

Commentary :

This code is calculating the sales rate of category 07 products for the top 10 customers who have bought the most of those products. Here is a breakdown of what each line does:

inner_join(df_receipt, df_product[c("product_cd","category_major_cd")], by="product_cd"): This joins the df_receipt and df_product data frames based on the product_cd column, and only keeps the product_cd and category_major_cd columns from df_product. The result is a new data frame that has information on the category of each product purchased in df_receipt.

%>% filter(category_major_cd == "07"): This filters the joined data frame from step 1 to only keep rows where the category_major_cd column is equal to "07".

%>% group_by(customer_id) %>% summarise(sum_07 = sum(amount), .groups = "drop"): This groups the filtered data frame from step 2 by customer_id and calculates the sum of the amount column for each group. The result is a new data frame that has the total amount spent on category 07 products by each customer.

%>% inner_join(df_receipt, by="customer_id"): This joins the data frame from step 3 with the original df_receipt data frame based on the customer_id column. The result is a new data frame that has the total amount spent by each customer on all products.

%>% group_by(customer_id) %>% summarise(sum_all=sum(amount), sum_07=max(sum_07), .groups = "drop"): This groups the joined data frame from step 4 by customer_id and calculates the sum of the amount column for each group, as well as the maximum value of the sum_07 column for each group. The result is a new data frame that has the total amount spent by each customer on all products, as well as the maximum amount spent by each customer on category 07 products.

%>% mutate(sales_rate = sum_07 / sum_all): This calculates the sales rate for each customer by dividing the maximum amount spent on category 07 products (sum_07) by the total amount spent on all products (sum_all).

%>% slice(1:10): This keeps only the top 10 rows of the data frame based on the sales rate column. The result is a new data frame that shows the top 10 customers with the highest sales rate for category 07 products, along with their total amount spent on all products and their maximum amount spent on category 07 products.

R-070: Calculate the number of days elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt) and display 10 items with the customer ID (customer_id), sales date and membership application date. (Note that sales_ymd is numeric and application_date is a string).

In [ ]:

df_receipt[c("customer_id", "sales_ymd")] %>% 
distinct(.,.keep_all=TRUE) %>% inner_join(df_customer[c("customer_id","application_date")], by="customer_id") %>% mutate(elapsed_days = as.integer( strptime(as.character(sales_ymd), "%Y%m%d") - strptime(application_date, "%Y%m%d"))) %>% 
select(customer_id, sales_ymd, application_date, elapsed_days) %>% 
slice(1:10)

A data.frame: 10 × 4

customer_id	sales_ymd	application_date	elapsed_days
<chr>	<int>	<chr>	<int>
CS006214000001	20181103	20150201	1371
CS008415000097	20181118	20150322	1337
CS028414000014	20170712	20150711	732
CS025415000050	20180821	20160131	933
CS003515000195	20190605	20150306	1552
CS024514000042	20181205	20151010	1152
CS040415000178	20190922	20150627	1548
CS027514000015	20191010	20151101	1439
CS025415000134	20190918	20150720	1521
CS021515000126	20171010	20150508	886

Commentary :

This code performs the following tasks:

Selects the columns 'customer_id' and 'sales_ymd' from the 'df_receipt' data frame.

Removes duplicated rows while keeping all columns ('distinct').

Joins the resulting data frame with the 'df_customer' data frame by 'customer_id'.

Adds a new column 'elapsed_days' that represents the difference in days between the 'sales_ymd' and 'application_date' columns, converted to integers.

Selects only the 'customer_id', 'sales_ymd', 'application_date', and 'elapsed_days' columns.

Returns the first 10 rows of the resulting data frame.

The inner workings of the code can be broken down into the following steps:

# Select customer_id and sales_ymd columns from df_receipt
df1 <- df_receipt[c("customer_id", "sales_ymd")]

# Remove duplicated rows while keeping all columns
df2 <- distinct(df1, .keep_all = TRUE)

# Join with df_customer on customer_id
df3 <- inner_join(df2, df_customer[c("customer_id", "application_date")], by = "customer_id")

# Add a new column with elapsed days
df4 <- mutate(df3, elapsed_days = as.integer(strptime(as.character(sales_ymd), "%Y%m%d") - strptime(application_date, "%Y%m%d")))

# Select columns and return first 10 rows
df5 <- select(df4, customer_id, sales_ymd, application_date, elapsed_days)
slice(df5, 1:10)

The code uses the dplyr package to perform data manipulations. The distinct function removes duplicated rows based on all columns, and the inner_join function joins two data frames based on a common column. The mutate function creates a new column in a data frame, and the select function selects specific columns from a data frame. Finally, the slice function returns a subset of rows from a data frame.

R-071: Calculate the number of months elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).

In [ ]:

df_receipt[c("customer_id", "sales_ymd")] %>% 
distinct(., .keep_all = TRUE) %>% 
inner_join(df_customer[c("customer_id", "application_date")], by = "customer_id") %>% mutate(elapsed_months = trunc(time_length( interval( strptime(application_date, "%Y%m%d"), strptime(as.character(sales_ymd), "%Y%m%d") ), "month"))) %>% 

select(customer_id, sales_ymd, application_date, elapsed_months) %>% 
slice(1:10)

A data.frame: 10 × 4

customer_id	sales_ymd	application_date	elapsed_months
<chr>	<int>	<chr>	<dbl>
CS006214000001	20181103	20150201	45
CS008415000097	20181118	20150322	43
CS028414000014	20170712	20150711	24
CS025415000050	20180821	20160131	30
CS003515000195	20190605	20150306	50
CS024514000042	20181205	20151010	37
CS040415000178	20190922	20150627	50
CS027514000015	20191010	20151101	47
CS025415000134	20190918	20150720	49
CS021515000126	20171010	20150508	29

Commentary :

This code performs the following operations:

df_receipt[c("customer_id", "sales_ymd")] selects only the columns customer_id and sales_ymd from the df_receipt data frame.

distinct(., .keep_all = TRUE) removes duplicate rows while keeping all columns. This ensures that there is only one row for each unique combination of customer_id and sales_ymd.

inner_join(df_customer[c("customer_id", "application_date")], by = "customer_id") joins the result from the previous step with the df_customer data frame on the customer_id column, selecting only the customer_id and application_date columns from df_customer.

mutate(elapsed_months = trunc(time_length( interval( strptime(application_date, "%Y%m%d"), strptime(as.character(sales_ymd), "%Y%m%d") ), "month"))) creates a new column elapsed_months that calculates the number of months between the application_date and sales_ymd dates. This is done by converting the dates to the appropriate format using strptime, calculating the difference using the interval function, and then using time_length to calculate the number of months. The trunc function is used to round down to the nearest integer.

select(customer_id, sales_ymd, application_date, elapsed_months) selects only the columns customer_id, sales_ymd, application_date, and elapsed_months.

slice(1:10) selects only the first 10 rows of the resulting data frame.

Overall, this code is calculating the number of months between the application date and the sales date for each customer, based on their purchase history.

R-072: Calculate the number of years elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (df_customer) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).

In [ ]:

df_receipt[c("customer_id", "sales_ymd")] %>% 
distinct(., .keep_all = TRUE) %>% 
inner_join(df_customer[c("customer_id", "application_date")], by = "customer_id") %>% mutate(elapsed_years = trunc(time_length(interval( strptime(application_date, "%Y%m%d"), strptime(as.character(sales_ymd), "%Y%m%d")), "year")))%>% 

select(customer_id, sales_ymd, application_date, elapsed_years) %>% 
slice(1:10)

A data.frame: 10 × 4

customer_id	sales_ymd	application_date	elapsed_years
<chr>	<int>	<chr>	<dbl>
CS006214000001	20181103	20150201	3
CS008415000097	20181118	20150322	3
CS028414000014	20170712	20150711	2
CS025415000050	20180821	20160131	2
CS003515000195	20190605	20150306	4
CS024514000042	20181205	20151010	3
CS040415000178	20190922	20150627	4
CS027514000015	20191010	20151101	3
CS025415000134	20190918	20150720	4
CS021515000126	20171010	20150508	2

Commentary :

This code performs the following operations:

Extracts two columns, customer_id and sales_ymd, from the df_receipt data frame.

Removes duplicated rows of the above two columns and keeps only the first occurrence of each unique pair of customer_id and sales_ymd.

Joins the resulting data frame with df_customer data frame on the customer_id column.

Computes the elapsed years between the application_date and sales_ymd dates for each customer by creating an interval using interval function and then using time_length function to extract the number of years. The trunc function is used to round down the result to an integer value.

Selects the customer_id, sales_ymd, application_date, and elapsed_years columns.

Finally, selects the first 10 rows of the resulting data frame using the slice function.

In summary, this code computes the elapsed years between a customer's application_date and their sales_ymd date.

R-073: Calculate the elapsed time in epoch seconds from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and (Note that sales_ymd is held as a number and application_date as a string). Note that no time information is held, so each date should represent 0:00:00:00.

In [ ]:

df_receipt[c("customer_id", "sales_ymd")] %>% 
distinct(., .keep_all = TRUE) %>% inner_join(df_customer[c("customer_id","application_date")], by="customer_id") %>% mutate(elapsed_epoch = as.numeric(strptime(as.character(sales_ymd), "%Y%m%d")) - as.numeric(strptime(application_date, "%Y%m%d"))) %>% 

select(customer_id, sales_ymd, application_date, elapsed_epoch) %>% 
slice(1:10)

A data.frame: 10 × 4

customer_id	sales_ymd	application_date	elapsed_epoch
<chr>	<int>	<chr>	<dbl>
CS006214000001	20181103	20150201	118454400
CS008415000097	20181118	20150322	115516800
CS028414000014	20170712	20150711	63244800
CS025415000050	20180821	20160131	80611200
CS003515000195	20190605	20150306	134092800
CS024514000042	20181205	20151010	99532800
CS040415000178	20190922	20150627	133747200
CS027514000015	20191010	20151101	124329600
CS025415000134	20190918	20150720	131414400
CS021515000126	20171010	20150508	76550400

Commentary :

This code is processing some data related to customers and their purchase history. Here is an explanation of the code line by line:

df_receipt[c("customer_id", "sales_ymd")]: select the columns "customer_id" and "sales_ymd" from the "df_receipt" dataframe.

distinct(., .keep_all = TRUE): remove duplicate rows while keeping all columns.

inner_join(df_customer[c("customer_id","application_date")], by="customer_id"): join the "df_customer" dataframe using the "customer_id" column.

mutate(elapsed_epoch = as.numeric(strptime(as.character(sales_ymd), "%Y%m%d")) - as.numeric(strptime(application_date, "%Y%m%d"))): create a new column called "elapsed_epoch" that calculates the number of seconds between the "sales_ymd" column and "application_date" column.

select(customer_id, sales_ymd, application_date, elapsed_epoch): select the columns "customer_id", "sales_ymd", "application_date", and "elapsed_epoch" from the joined dataframe.

slice(1:10): select the first 10 rows of the resulting dataframe.

R-074: For the sales date (sales_ymd) of the receipt details data (df_receipt), calculate the number of days elapsed since Monday of the week in question and display 10 items together with the sales date and the previous Monday (note that sales_ymd holds data in numerical form).

In [ ]:

df_receipt["sales_ymd"] %>% 
# The following converts Sunday to the last day of the previous week by shifting it one day forward because the start day is Sunday. 
# In this situation, floor_date will get Sunday of the previous week as the start day of the week, so. # 1 is added to make the start day Monday. 
mutate(monday = as.Date(floor_date( strptime(as.character(sales_ymd), "%Y%m%d") - 1 , unit = "week")) + 1) %>% 
mutate(elapsed_days = as.integer( as.Date( strptime(as.character(sales_ymd), "%Y%m%d")) - monday)) %>% 
select(sales_ymd, elapsed_days, monday) %>% 
slice(1:10)

A data.frame: 10 × 3

sales_ymd	elapsed_days	monday
<int>	<int>	<date>
20181103	5	2018-10-29
20181118	6	2018-11-12
20170712	2	2017-07-10
20190205	1	2019-02-04
20180821	1	2018-08-20
20190605	2	2019-06-03
20181205	2	2018-12-03
20190922	6	2019-09-16
20170504	3	2017-05-01
20191010	3	2019-10-07

Commentary :

This code calculates the number of elapsed days since the start of the week (Monday) for each sales date in the df_receipt data frame. Here is a breakdown of the code:

df_receipt["sales_ymd"]: This selects the sales_ymd column from the df_receipt data frame.

%>%: This is the pipe operator that passes the result from the previous operation to the next operation.

mutate(monday = as.Date(floor_date( strptime(as.character(sales_ymd), "%Y%m%d") - 1 , unit = "week")) + 1): This creates a new column called monday that represents the Monday of the week for each sales date. It does this by converting the sales_ymd column to a date object, subtracting one day, rounding down to the previous week (with Sunday being the end of the week), and adding one day to make it the start of the week (Monday).

%>%: Another pipe operator.

mutate(elapsed_days = as.integer( as.Date( strptime(as.character(sales_ymd), "%Y%m%d")) - monday)): This creates a new column called elapsed_days that represents the number of days that have elapsed since the start of the week (Monday) for each sales date. It does this by converting the sales_ymd column to a date object, subtracting the Monday of the week (from the monday column), and then converting the result to an integer.

%>%: Another pipe operator.

select(sales_ymd, elapsed_days, monday): This selects the sales_ymd, elapsed_days, and monday columns.

%>%: Another pipe operator.

slice(1:10): This selects the first ten rows of the resulting data frame.

R-075: Randomly extract 1% of the data from the customer data (df_customer) and display the first 10 items.

In [ ]:

head(sample_frac(tbl = df_customer, size = 0.01), 10)

A data.frame: 10 × 11

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	<chr>	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>
1	CS001612000311	岡本美佳	1	女性	1951-03-21	68	211-0015	神奈川県川崎市中原区北谷町**********	S13001	20171118	0-00000000-0
2	CS018415000209	西村由美子	1	女性	1974-11-26	44	203-0051	東京都東久留米市小山**********	S13018	20151001	C-20101026-D
3	CS020415000109	岡村里奈	1	女性	1969-02-02	50	173-0001	東京都板橋区本町**********	S13020	20150207	A-20100922-D
4	CS040403000046	辻獅童	0	男性	1977-01-23	42	226-0016	神奈川県横浜市緑区霧が丘**********	S14040	20150813	0-00000000-0
5	CS044313000008	河原育子	1	女性	1985-06-03	33	144-0056	東京都大田区西六郷**********	S13044	20161204	0-00000000-0
6	CS038502000025	小柳ケンイチ	0	男性	1963-06-01	55	134-0015	東京都江戸川区西瑞江**********	S13038	20151214	0-00000000-0
7	CS034402000064	川越明	0	男性	1972-11-27	46	213-0031	神奈川県川崎市高津区宇奈根**********	S14034	20170913	0-00000000-0
8	CS015303000005	寺西一徳	0	男性	1987-01-16	32	135-0016	東京都江東区東陽**********	S13015	20150223	0-00000000-0
9	CS029411000007	森永璃子	1	女性	1976-12-01	42	134-0085	東京都江戸川区南葛西**********	S12029	20150709	5-20101007-9
10	CS048612000001	劇団恵子	1	女性	1956-01-11	63	224-0053	神奈川県横浜市都筑区池辺町**********	S14048	20170920	0-00000000-0

Commentary :

This code selects a random sample of 1% of the rows in the df_customer dataframe using the sample_frac() function from the dplyr package, and then returns the first 10 rows of the resulting dataframe using the head() function.

The sample_frac() function is used to randomly sample a fraction of rows from a dataframe. In this case, the tbl argument specifies the dataframe to sample from, and the size argument specifies the fraction of rows to sample (0.01, or 1%). The resulting dataframe is then passed to the head() function to return the first 10 rows.

R-076: Extract 10% of the data stratified randomly from the customer data (df_customer) based on the percentage of gender code (gender_cd) and count the number of cases by gender code

In [ ]:

set.seed(71) df_customer %>% 
group_by(gender_cd) %>% 
sample_frac(0.1) %>% 
summarise(customer_num = n(), .groups = "drop")

A tibble: 3 × 2

gender_cd	customer_num
<chr>	<int>
0	298
1	1792
9	107

Commentary :

This code uses the dplyr package to randomly sample 10% of customers from each gender group and count the number of customers in each sample.

set.seed(71) sets a seed for the random number generator to ensure reproducibility of the sampling.

df_customer %>% takes the df_customer data frame as input and passes it to the next operation using the pipe operator %>%.

group_by(gender_cd) groups the data by the gender code column.

sample_frac(0.1) samples 10% of customers from each gender group. The sample_frac() function is used to randomly sample a fraction of rows from the grouped data frame. The argument 0.1 specifies the fraction to be sampled.

summarise(customer_num = n(), .groups = "drop") calculates the number of customers in each gender group. The n() function counts the number of rows in each group, and renames the resulting column to customer_num. The .groups argument is set to "drop" to remove the grouping information from the output.

The final output will be a data frame with two columns: gender_cd and customer_num. The gender_cd column specifies the gender code, and the customer_num column specifies the number of customers sampled from each gender group.

R-077: Sum the sales amounts in the receipt details data (df_receipt) by customer unit and extract outliers of the summed sales amounts. The outliers should be calculated by logarithmising the total sales amount, calculating the mean and standard deviation, and then deviating from the mean by more than 3σ (either the natural logarithm or the ordinary logarithm is acceptable). Display 10 results.

In [ ]:

df_receipt %>% 
group_by(customer_id) %>% 
summarise(sum_amount = sum(amount), .groups = "drop") %>% 
mutate(log_sum_amount = log(sum_amount + 0.5)) %>% 
filter(abs(log_sum_amount - mean(log_sum_amount)) / sd(log_sum_amount) > 3) %>% slice(1:10)

A tibble: 1 × 3

customer_id	sum_amount	log_sum_amount
<chr>	<int>	<dbl>
ZZ000000000000	12395003	16.3328

Commentary :

This code is processing a data frame called df_receipt and doing the following:

Grouping the data frame by customer_id using group_by function.

Calculating the sum of amount for each customer using summarise function and storing it in a new column called sum_amount.

Taking the logarithm of sum_amount column and storing it in a new column called log_sum_amount.

Calculating the z-score of log_sum_amount by subtracting the mean of log_sum_amount from each value and then dividing the result by the standard deviation of log_sum_amount.

Filtering out customers whose log_sum_amount z-score is greater than 3 (i.e., those whose log_sum_amount is more than three standard deviations away from the mean).

Selecting the first 10 rows of the resulting data frame using the slice function.

In summary, the code is identifying customers whose total spending is significantly higher or lower than the average by calculating their z-score and filtering out those whose z-score exceeds a certain threshold.

R-078: Sum the sales amount (amount) of the receipt details data (df_receipt) in customer units and extract outliers of the summed sales amount. However, exclude those whose customer ID starts with “Z”, as they represent non-members. Outliers are defined as the difference between the first and third quartiles using IQR, and are defined as being below “first quartile – 1.5 x IQR” or above “third quartile + 1.5 x IQR”. Show 10 results.

In [ ]:

df_receipt %>% 
group_by(customer_id) %>% 
filter(!grepl("^Z", customer_id)) %>% 
summarise(sum_amount = sum(amount), .groups = "drop") %>% 

filter( sum_amount < quantile(sum_amount)[2] - 1.5 * (quantile(sum_amount)[4] - quantile(sum_amount)[2]) | sum_amount > quantile(sum_amount)[4] + 1.5 * (quantile(sum_amount)[4] - quantile(sum_amount)[2]) ) %>% 

slice(1:10)

A tibble: 10 × 2

customer_id	sum_amount
<chr>	<int>
CS001414000048	8584
CS001605000009	18925
CS002415000594	9568
CS004414000181	9584
CS005415000137	8734
CS006414000001	9156
CS006414000029	9179
CS006415000105	10042
CS006415000147	12723
CS006415000157	10648

Commentary :

This code performs the following operations:

Group the df_receipt data frame by customer_id.

Filter out all the rows where the customer_id starts with the letter "Z".

Sum the amount for each customer_id.

Filter out the top and bottom 1.5% of the sum_amount values using the interquartile range (IQR) method.

Select the first 10 rows of the resulting data frame.

The IQR method is a way to detect and remove outliers from a dataset based on the quartiles. In this case, the upper and lower bounds are calculated as follows:

Upper bound: quantile(sum_amount)[4] + 1.5 * (quantile(sum_amount)[4] - quantile(sum_amount)[2])

Lower bound: quantile(sum_amount)[2] - 1.5 * (quantile(sum_amount)[4] - quantile(sum_amount)[2])

Any sum_amount values that fall outside of these bounds are filtered out. The resulting data frame contains the customer_id and sum_amount for the first 10 rows that meet the filter condition.

R-079: For each item of product data (df_product), check the number of missing items.

In [ ]:

sapply(df_product, function(x) sum(is.na(x)))

product_cd: 0
category_major_cd: 0
category_medium_cd: 0
category_small_cd: 0
unit_price: 7
unit_cost: 7

Commentary :

The code computes the number of missing values in each column of the df_product dataframe using the sapply() function.

sapply() applies a function to each column of the dataframe df_product. Here, the function used is function(x) sum(is.na(x)), which computes the number of missing values (NA) in the input column x. is.na() returns a logical vector of the same length as the input vector with TRUE for any missing value (NA) and FALSE otherwise. sum() is used to sum up the TRUE values in the logical vector, which gives the count of missing values in the column.

The output of the code is a named vector with column names of df_product as names and the corresponding counts of missing values as values.

R-080: Create a new product data, deleting all records where any item in the product data (df_product) is missing. Note that the number of items before and after the deletion should be displayed, and also confirm that the number of items has decreased by the number of items checked in 079.

In [ ]:

df_product_1 <- na.omit(df_product) 
paste("Before deletion:", nrow(df_product)) 
paste("After deletion:", nrow(df_product_1))

'Before deletion: 10030'
'After deletion: 10023'

Commentary :

The first line of code creates a new data frame df_product_1 that removes all rows with missing values (NA) from the original data frame df_product, using the na.omit() function.

The second line of code uses the paste() function to print a message indicating the number of rows in df_product before and after the removal of NAs. Specifically, nrow(df_product) returns the number of rows in df_product, and the first paste() call concatenates this number with the string "Before deletion:". Similarly, nrow(df_product_1) returns the number of rows in the new data frame df_product_1, and the second paste() call concatenates this number with the string "After deletion:".

This code provides a quick way to check the effect of removing NAs on the number of observations in the data frame.

Data Science 100 Knocks (Structured Data Processing) - R

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - R Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - R Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - R Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - R Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - R Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.