参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess/blob/master/docker/work/answer/ans_preprocess_knock_R.ipynb

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction

To begin with, run the following cells
Import the required libraries and read the data from the database (PostgreSQL)
The libraries that are expected to be used are imported in the following cells
If there are other libraries you wish to use, install them as appropriate using install.packages().
Names, addresses, etc. are dummy data and are not real.

In [ ]:

require("RPostgreSQL") 
require("tidyr") 
require("dplyr") 
require("stringr") 
require("caret") 
require("lubridate") 
require("rsample") 
require("recipes") 
require("themis") 

host <- "db" 
port <- Sys.getenv()["PG_PORT"] 
dbname <- Sys.getenv()["PG_DATABASE"] 
user <- Sys.getenv()["PG_USER"] 
password <- Sys.getenv()["PG_PASSWORD"] 

con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password) 
df_customer <- dbGetQuery(con,"SELECT * FROM customer") 
df_category <- dbGetQuery(con,"SELECT * FROM category") 
df_product <- dbGetQuery(con,"SELECT * FROM product") 
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt") 
df_store <- dbGetQuery(con,"SELECT * FROM store") 
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")

Loading required package: RPostgreSQL

Loading required package: DBI

Loading required package: tidyr

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: stringr

Loading required package: caret

Loading required package: ggplot2

Loading required package: lattice

Loading required package: lubridate


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


Loading required package: rsample

Loading required package: recipes


Attaching package: ‘recipes’


The following object is masked from ‘package:stringr’:

    fixed


The following object is masked from ‘package:stats’:

    step


Loading required package: themis

Registered S3 methods overwritten by 'themis':
  method                  from   
  bake.step_downsample    recipes
  bake.step_upsample      recipes
  prep.step_downsample    recipes
  prep.step_upsample      recipes
  tidy.step_downsample    recipes
  tidy.step_upsample      recipes
  tunable.step_downsample recipes
  tunable.step_upsample   recipes


Attaching package: ‘themis’


The following objects are masked from ‘package:recipes’:

    step_downsample, step_upsample

Practice Questions

R-021: Count the number of cases against the receipt details data (df_receipt).

In [ ]:

nrow(df_receipt)

Commentary :

The code "nrow(df_receipt)" is written in R or a similar statistical programming language. It is used to determine the number of rows in a data frame called "df_receipt".

Here's a breakdown of the code:

"nrow()" is a built-in R function that returns the number of rows in a data frame.

"df_receipt" is the name of the data frame whose number of rows we want to determine.

For example, if the data frame "df_receipt" contains 100 rows of data, then running this code would return the value 100. This code is useful for quickly checking the size of a data frame and ensuring that it has the expected number of rows before performing further analysis or manipulating the data in some way.

R-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).

In [ ]:

length(unique(df_receipt$customer_id))

Commentary :

The code "length(unique(df_receipt$customer_id))" is written in R or a similar statistical programming language. It is used to determine the number of unique customers in a data frame called "df_receipt".

Here's a breakdown of the code:

"unique()" is an R function that returns a vector containing only the unique elements of a vector. In this case, the vector being passed to the "unique()" function is "df_receipt$customer_id", which is a column in the "df_receipt" data frame that contains customer IDs.

"length()" is another built-in R function that returns the number of elements in a vector. In this case, the vector being passed to "length()" is the output of the "unique()" function, which is a vector containing unique customer IDs.

Putting it all together, the code calculates the number of unique customers in the "df_receipt" data frame by taking the length of a vector of unique customer IDs. For example, if the data frame contains 1000 rows, but only 500 unique customer IDs, then running this code would return the value 500. This code is useful for understanding the number of unique customers in a data set and for calculating customer-level metrics such as customer lifetime value.

R-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).

In [ ]:

df_receipt %>% 
group_by(store_cd) %>% 
summarise(amount = sum(amount), quantity = sum(quantity), .groups = "drop")

A tibble: 52 × 3

store_cd	amount	quantity
<chr>	<int>	<int>
S12007	638761	2099
S12013	787513	2425
S12014	725167	2358
S12029	794741	2555
S12030	684402	2403
S13001	811936	2347
S13002	727821	2340
S13003	764294	2197
S13004	779373	2390
S13005	629876	2004
S13008	809288	2491
S13009	808870	2486
S13015	780873	2248
S13016	793773	2432
S13017	748221	2376
S13018	790535	2562
S13019	827833	2541
S13020	796383	2383
S13031	705968	2336
S13032	790501	2491
S13035	715869	2219
S13037	693087	2344
S13038	708884	2337
S13039	611888	1981
S13041	728266	2233
S13043	587895	1881
S13044	520764	1729
S13051	107452	354
S13052	100314	250
S14006	712839	2284
S14010	790361	2290
S14011	805724	2434
S14012	720600	2412
S14021	699511	2231
S14022	651328	2047
S14023	727630	2258
S14024	736323	2417
S14025	755581	2394
S14026	824537	2503
S14027	714550	2303
S14028	786145	2458
S14033	725318	2282
S14034	653681	2024
S14036	203694	635
S14040	701858	2233
S14042	534689	1935
S14045	458484	1398
S14046	412646	1354
S14047	338329	1041
S14048	234276	769
S14049	230808	788
S14050	167090	580

Commentary :

The code "df_receipt %>% group_by(store_cd) %>% summarise(amount = sum(amount), quantity = sum(quantity), .groups = 'drop')" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "store_cd" and then calculate the sum of the "amount" and "quantity" variables within each store group.

Here's a breakdown of the code:

The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()" and "summarise()" functions together.

"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "store_cd" variable.

"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the sum of the "amount" and "quantity" variables within each store group. We are also specifying the names of the new variables using the assignment operator "=".

The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.

The result of this code will be a new data frame with one row for each store in the "df_receipt" data frame. The "amount" and "quantity" columns will contain the sums of those variables within each store group. This code is useful for calculating store-level metrics such as total sales or units sold.

R-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).

In [ ]:

df_receipt %>% 
group_by(customer_id) %>% 
summarise(max_ymd = max(sales_ymd), .groups = "drop") %>% 
slice(1:10)

A tibble: 10 × 2

customer_id	max_ymd
<chr>	<int>
CS001113000004	20190308
CS001114000005	20190731
CS001115000010	20190405
CS001205000004	20190625
CS001205000006	20190224
CS001211000025	20190322
CS001212000027	20170127
CS001212000031	20180906
CS001212000046	20170811
CS001212000070	20191018

Commentary :

The code "df_receipt %>% group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd), .groups = 'drop') %>% slice(1:10)" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "customer_id" and then find the latest date of purchase for each customer. Finally, it slices the first 10 rows of the resulting data frame.

Here's a breakdown of the code:

The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()", "summarise()", and "slice()" functions together.

"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "customer_id" variable.

"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the latest purchase date for each customer using the "max()" function on the "sales_ymd" variable.

The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.

"slice()" is an R function that is used to extract a subset of rows from a data frame. In this case, we are selecting the first 10 rows of the resulting data frame.

The result of this code will be a new data frame with one row for each customer in the "df_receipt" data frame. The "max_ymd" column will contain the latest purchase date for each customer. Finally, the code selects the first 10 rows of the resulting data frame. This code is useful for understanding the most recent purchase date for each customer and can be used for various customer-level analysis and segmentation purposes.

R-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.

In [ ]:

df_receipt %>% 
group_by(customer_id) %>% 
summarise(min_ymd = min(sales_ymd), .groups = "drop") %>% 
slice(1:10)

A tibble: 10 × 2

customer_id	min_ymd
<chr>	<int>
CS001113000004	20190308
CS001114000005	20180503
CS001115000010	20171228
CS001205000004	20170914
CS001205000006	20180207
CS001211000025	20190322
CS001212000027	20170127
CS001212000031	20180906
CS001212000046	20170811
CS001212000070	20191018

Commentary :

This code is using the %>% (pipe) operator to chain together a sequence of operations on a data frame called df_receipt.

Here's a step-by-step explanation of what each operation does:

group_by(customer_id): This groups the data by the customer_id column, so that subsequent operations are performed separately for each customer.

summarise(min_ymd = min(sales_ymd), .groups = "drop"): This creates a summary of the data within each group, calculating the minimum value of the sales_ymd column and storing it in a new column called min_ymd. The .groups = "drop" argument tells dplyr to drop the grouping information from the resulting data frame (since we don't need it for further analysis).

slice(1:10): This selects the first 10 rows of the resulting data frame. Since the data was grouped by customer_id and the summary function was used, this will give us the earliest sales_ymd value for the first 10 customers in the data set.

Overall, this code is selecting the earliest sales_ymd value for each customer in the df_receipt data frame, and returning the earliest dates for the first 10 customers in the data set.

R-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.

In [ ]:

df_receipt %>% 
group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd) ,min_ymd = min(sales_ymd), .groups = "drop") %>% 
filter(max_ymd != min_ymd) %>% 
slice(1:10)

A tibble: 10 × 3

customer_id	max_ymd	min_ymd
<chr>	<int>	<int>
CS001114000005	20190731	20180503
CS001115000010	20190405	20171228
CS001205000004	20190625	20170914
CS001205000006	20190224	20180207
CS001214000009	20190902	20170306
CS001214000017	20191006	20180828
CS001214000048	20190929	20171109
CS001214000052	20190617	20180208
CS001215000005	20181021	20170206
CS001215000040	20171022	20170214

Commentary :

This code is using the %>% (pipe) operator to chain together a sequence of operations on a data frame called df_receipt.

Here's a step-by-step explanation of what each operation does:

group_by(customer_id): This groups the data by the customer_id column, so that subsequent operations are performed separately for each customer.

summarise(max_ymd = max(sales_ymd), min_ymd = min(sales_ymd), .groups = "drop"): This creates a summary of the data within each group, calculating the maximum and minimum values of the sales_ymd column and storing them in new columns called max_ymd and min_ymd, respectively. The .groups = "drop" argument tells dplyr to drop the grouping information from the resulting data frame (since we don't need it for further analysis).

filter(max_ymd != min_ymd): This filters the data frame to include only those rows where the max_ymd value is not equal to the min_ymd value. This will exclude any customers who made only a single purchase during the period covered by the data set.

slice(1:10): This selects the first 10 rows of the resulting data frame. Since the data was grouped by customer_id, summarized using the max and min functions, and filtered based on those summary statistics, this will give us the first 10 customers who made more than one purchase during the period covered by the data set.

Overall, this code is selecting the customers who made more than one purchase during the period covered by the df_receipt data frame, and returning the dates of their earliest and latest purchases. The code then returns the earliest and latest purchases for the first 10 customers who meet these criteria.

R-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.

In [ ]:

df_receipt %>% 
group_by(store_cd) %>% 
summarise(mean_amount = mean(amount), .groups = "drop") %>% 
arrange(desc(mean_amount)) %>% 
slice(1:5)

A tibble: 5 × 2

store_cd	mean_amount
<chr>	<dbl>
S13052	402.8675
S13015	351.1120
S13003	350.9155
S14010	348.7913
S13001	348.4704

Commentary :

This code is using the %>% (pipe) operator to chain together a sequence of operations on a data frame called df_receipt.

Here's a step-by-step explanation of what each operation does:

group_by(store_cd): This groups the data by the store_cd column, so that subsequent operations are performed separately for each store.

summarise(mean_amount = mean(amount), .groups = "drop"): This creates a summary of the data within each group, calculating the mean value of the amount column and storing it in a new column called mean_amount. The .groups = "drop" argument tells dplyr to drop the grouping information from the resulting data frame (since we don't need it for further analysis).

arrange(desc(mean_amount)): This sorts the resulting data frame by the mean_amount column in descending order (from highest to lowest).

slice(1:5): This selects the first 5 rows of the resulting data frame. Since the data was grouped by store_cd and summarized using the mean function, this will give us the stores with the highest mean amount values across all transactions.

Overall, this code is selecting the stores with the highest mean amount values across all transactions in the df_receipt data frame, and returning the top 5 stores based on those values.

R-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.

In [ ]:

df_receipt %>% 
group_by(store_cd) %>% 
summarise(median_amount = median(amount), .groups = "drop") %>% arrange(desc(median_amount)) %>% 
slice(1:5)

A tibble: 5 × 2

store_cd	median_amount
<chr>	<dbl>
S13052	190
S14010	188
S14050	185
S13003	180
S13018	180

Commentary :

This code is using the %>% (pipe) operator to chain together a sequence of operations on a data frame called df_receipt.

Here's a step-by-step explanation of what each operation does:

group_by(store_cd): This groups the data by the store_cd column, so that subsequent operations are performed separately for each store.

summarise(median_amount = median(amount), .groups = "drop"): This creates a summary of the data within each group, calculating the median value of the amount column and storing it in a new column called median_amount. The .groups = "drop" argument tells dplyr to drop the grouping information from the resulting data frame (since we don't need it for further analysis).

arrange(desc(median_amount)): This sorts the resulting data frame by the median_amount column in descending order (from highest to lowest).

slice(1:5): This selects the first 5 rows of the resulting data frame. Since the data was grouped by store_cd and summarized using the median function, this will give us the stores with the highest median amount values across all transactions.

Overall, this code is selecting the stores with the highest median amount values across all transactions in the df_receipt data frame, and returning the top 5 stores based on those values. The difference between this code and the previous one I explained is that this code is using the median function instead of the mean function to calculate the central tendency of the amount values for each store.

R-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.

In [ ]:

# Code example 1 
df_receipt %>% 
group_by(store_cd, product_cd) %>% 
summarise(cnt = n(), .groups = "drop_last") %>% 
filter(cnt == cnt %>% max()) %>% 
ungroup() %>% 
slice(1:10)

A tibble: 10 × 3

store_cd	product_cd	cnt
<chr>	<chr>	<int>
S12007	P060303001	72
S12013	P060303001	107
S12014	P060303001	65
S12029	P060303001	92
S12030	P060303001	115
S13001	P060303001	67
S13002	P060303001	78
S13003	P071401001	65
S13004	P060303001	88
S13005	P040503001	36

Commentary :

This code is using the dplyr package to perform data manipulation on a data frame called df_receipt. Here is a step-by-step explanation of what the code is doing:

%>% is a pipe operator used to chain functions together, so the output of one function becomes the input of the next function.

The group_by function groups the data frame by store_cd and product_cd.

The summarise function calculates a summary statistic for each group. In this case, it is counting the number of rows in each group and creating a new column called cnt.

The .groups = "drop_last" argument is used to drop the last grouping level created by the summarise function, which is the product_cd grouping. This is done to make it easier to filter on the maximum count later on.

The filter function selects rows that meet a certain condition. In this case, it is selecting rows where the cnt column is equal to the maximum value of the cnt column in the data frame.

The ungroup function removes all grouping levels from the data frame.

The slice function selects a subset of rows based on their position. In this case, it is selecting the first 10 rows of the data frame.

So overall, this code is grouping the data by store and product, calculating the count of each group, filtering for the groups with the highest count, and then selecting the top 10 results.

In [ ]:

# Code example 2: using which.max() (multiple occurrences of mode number, narrowed down to one of them) 
# Display of the number of cases is omitted 

table_product <-table(df_receipt$store_cd,df_receipt$product_cd) 

store <- names(table_product[,1]) 

mode_product <- c() 
for (i in 1:length(store)){ 
     mode_product[i] <- names(which.max(table_product[i,])) 
} 

data.frame(store_cd = store, product_cd = mode_product) %>% 
  slice(1:10)

A data.frame: 10 × 2

store_cd	product_cd
<chr>	<chr>
S12007	P060303001
S12013	P060303001
S12014	P060303001
S12029	P060303001
S12030	P060303001
S13001	P060303001
S13002	P060303001
S13003	P071401001
S13004	P060303001
S13005	P040503001

Commentary :

This code is using the table function and for loop to find the mode (most frequent value) of each product code for each store in the df_receipt data frame.

Here's a step-by-step explanation of what the code is doing:

The table function creates a contingency table of counts for each combination of store_cd and product_cd in df_receipt. The resulting object, table_product, is a matrix where the rows correspond to each store and the columns correspond to each product code.

The names function extracts the store codes from the first column of the table_product matrix and saves them as a character vector called store.

An empty vector called mode_product is created to store the mode of each product code for each store.

The for loop iterates through each store in store and finds the mode of each product code for that store. The which.max function is used to find the index of the maximum count in each row of the table_product matrix, and the names function is used to extract the product code with that index. The resulting product code is stored in the mode_product vector.

A data frame is created with two columns: store_cd and product_cd. The store_cd column is populated with the values in the store vector, and the product_cd column is populated with the values in the mode_product vector.

The %>% operator is used to pipe the data frame to the slice function, which selects the first 10 rows of the data frame.

So overall, this code is finding the mode of each product code for each store in df_receipt, creating a data frame with these values, and then selecting the top 10 results.

R-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.

In [ ]:

var_sample <- function(x){ var(x) * (length(x) - 1) / length(x) } 

df_receipt %>% 
 group_by(store_cd) %>% 
 summarise(var_amount = var_sample(amount), .groups = "drop") %>%
 arrange(desc(var_amount)) %>% slice(1:5)

A tibble: 5 × 2

store_cd	var_amount
<chr>	<dbl>
S13052	440088.7
S14011	306314.6
S14034	296920.1
S13001	295432.0
S13015	295294.4

Commentary :

This code is using the dplyr package to calculate the sample variance of the amount variable for each store in the df_receipt data frame, and then selecting the top 5 stores with the highest variance.

Here's a step-by-step explanation of what the code is doing:

The var_sample function is defined, which calculates the sample variance of a vector of values x using the formula var(x) * (length(x) - 1) / length(x). This formula adjusts the variance by dividing by length(x) - 1 instead of length(x) to provide an unbiased estimate of the population variance.

The %>% operator is used to pipe the df_receipt data frame to the next function in the chain.

The group_by function is used to group the data by store_cd.

The summarise function calculates a summary statistic for each group. In this case, it is using the var_sample function to calculate the sample variance of the amount variable for each group, and creating a new column called var_amount.

The .groups = "drop" argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

The arrange function is used to sort the data frame by var_amount in descending order, so that the stores with the highest variance appear first.

The %>% operator is used to pipe the sorted data frame to the slice function, which selects the first 5 rows of the data frame.

So overall, this code is grouping the data by store, calculating the sample variance of the amount variable for each group, sorting the resulting data frame by variance in descending order, and then selecting the top 5 results.

R-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.

In [ ]:

var_sample <- function(x){ var(x)*(length(x)-1)/length(x) } 

std_sample <- function(x){ sqrt(var_sample(x)) } 

head( 
     df_receipt %>% 
     group_by(store_cd) %>% 
     summarise(std_amount = std_sample(amount), .groups = "drop") %>%
     arrange(desc(std_amount)) , 
n = 5
)

A tibble: 5 × 2

store_cd	std_amount
<chr>	<dbl>
S13052	663.3918
S14011	553.4569
S14034	544.9037
S13001	543.5366
S13015	543.4099

Commentary :

This code is using the dplyr package to calculate the sample standard deviation of the amount variable for each store in the df_receipt data frame, and then selecting the top 5 stores with the highest standard deviation.

Here's a step-by-step explanation of what the code is doing:

The var_sample function is defined, which calculates the sample variance of a vector of values x using the formula var(x) * (length(x) - 1) / length(x). This formula adjusts the variance by dividing by length(x) - 1 instead of length(x) to provide an unbiased estimate of the population variance.

The std_sample function is defined, which calculates the sample standard deviation of a vector of values x using the formula sqrt(var_sample(x)).

The head function is used to select the first 5 rows of the resulting data frame.

The %>% operator is used to pipe the df_receipt data frame to the next function in the chain.

The group_by function is used to group the data by store_cd.

The summarise function calculates a summary statistic for each group. In this case, it is using the std_sample function to calculate the sample standard deviation of the amount variable for each group, and creating a new column called std_amount.

The .groups = "drop" argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

The arrange function is used to sort the data frame by std_amount in descending order, so that the stores with the highest standard deviation appear first.

The n = 5 argument is used to limit the output to the top 5 rows.

So overall, this code is grouping the data by store, calculating the sample standard deviation of the amount variable for each group, sorting the resulting data frame by standard deviation in descending order, and then selecting the top 5 results.

R-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.

In [ ]:

df_receipt %>% 
 summarise(amount_25per = quantile(amount, 0.25), 
           amount_50per = quantile(amount, 0.5), 
           amount_75per = quantile(amount, 0.75), 
           amount_100per = quantile(amount, 1.0))

A data.frame: 1 × 4

amount_25per	amount_50per	amount_75per	amount_100per
<dbl>	<dbl>	<dbl>	<dbl>
102	170	288	10925

Commentary :

This code is using the dplyr package to calculate the quartiles of the amount variable in the df_receipt data frame.

Here's a step-by-step explanation of what the code is doing:

The %>% operator is used to pipe the df_receipt data frame to the next function in the chain.

The summarise function is used to calculate summary statistics for the data frame. In this case, it is using the quantile function to calculate the 25th, 50th (median), 75th, and 100th percentiles of the amount variable.

The resulting data frame will have 1 row and 4 columns, with the column names amount_25per, amount_50per, amount_75per, and amount_100per, and the corresponding values of the quartiles.

So overall, this code is calculating the quartiles of the amount variable in the df_receipt data frame, and returning a summary data frame with the quartile values.

R-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.

In [ ]:

df_receipt %>% 
 group_by(store_cd) %>% 
 summarise(mean_amount = mean(amount), .groups = "drop") %>% 
 filter(mean_amount >= 330)

A tibble: 13 × 2

store_cd	mean_amount
<chr>	<dbl>
S12013	330.1941
S13001	348.4704
S13003	350.9155
S13004	330.9439
S13015	351.1120
S13019	330.2086
S13020	337.8799
S13052	402.8675
S14010	348.7913
S14011	335.7183
S14026	332.3406
S14045	330.0821
S14047	330.0771

Commentary :

This code is using the dplyr package to group the df_receipt data frame by store_cd, calculate the mean amount for each group, and then filter the resulting data frame to include only those groups where the mean amount is greater than or equal to 330.

Here's a step-by-step explanation of what the code is doing:

The %>% operator is used to pipe the df_receipt data frame to the next function in the chain.

The group_by function is used to group the data by store_cd.

The summarise function is used to calculate a summary statistic for each group. In this case, it is using the mean function to calculate the mean amount for each group, and creating a new column called mean_amount.

The .groups = "drop" argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

The filter function is used to subset the data frame to only include rows where the mean_amount is greater than or equal to 330.

So overall, this code is grouping the data by store_cd, calculating the mean amount for each group, and then selecting only those groups where the mean amount is greater than or equal to 330. The resulting data frame will include one row for each qualifying store, with the store code and mean amount.

R-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_mean <- df_receipt %>% 
 filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") %>% 
 summarise(mean_amount = mean(sum_amount)) 

df_mean$mean_amount

2547.74223452926

Commentary :

This code is using the dplyr package to calculate the mean amount spent by each customer, after excluding any customer whose customer_id starts with the letter "Z".

Here's a step-by-step explanation of what the code is doing:

The %>% operator is used to pipe the df_receipt data frame to the next function in the chain.

The filter function is used to exclude any rows where the customer_id starts with the letter "Z".

The group_by function is used to group the data by customer_id.

The summarise function is used to calculate a summary statistic for each group. In this case, it is using the sum function to calculate the total amount spent by each customer, and creating a new column called sum_amount.

The .groups = "drop" argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

Another summarise function is used to calculate the mean sum_amount across all customers.

The resulting mean_amount value is extracted using $mean_amount.

So overall, this code is filtering the data to exclude any rows where the customer_id starts with "Z", grouping the data by customer_id, calculating the total amount spent by each customer, calculating the mean sum_amount across all customers, and returning the resulting mean value. The resulting output is a single numeric value representing the mean amount spent by each customer.

R-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_sum <- df_receipt %>% 
 filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") 

df_mean <- df_sum %>% 
 summarise(mean_amount = mean(sum_amount)) 

df_sum %>% 
 filter(sum_amount >= df_mean$mean_amount) %>% 
 slice(1:10)

A tibble: 10 × 2

customer_id	sum_amount
<chr>	<int>
CS001115000010	3044
CS001205000006	3337
CS001214000009	4685
CS001214000017	4132
CS001214000052	5639
CS001215000040	3496
CS001304000006	3726
CS001305000005	3485
CS001305000011	4370
CS001315000180	3300

Commentary :

This code is using the dplyr package to filter the customers who have spent an amount greater than or equal to the mean amount spent by all customers, after excluding any customer whose customer_id starts with the letter "Z". It then selects the top 10 customers based on their total amount spent.

Here's a step-by-step explanation of what the code is doing:

The %>% operator is used to pipe the df_receipt data frame to the next function in the chain.

The filter function is used to exclude any rows where the customer_id starts with the letter "Z".

The group_by function is used to group the data by customer_id.

The summarise function is used to calculate a summary statistic for each group. In this case, it is using the sum function to calculate the total amount spent by each customer, and creating a new column called sum_amount.

The .groups = "drop" argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

The df_sum data frame is created by storing the output of the above operations.

The summarise function is used again to calculate the mean sum_amount across all customers in df_sum.

The resulting mean_amount value is extracted using $mean_amount.

The %>% operator is used to pipe df_sum to the next function in the chain.

The filter function is used to select only those rows where the sum_amount is greater than or equal to the mean_amount calculated in step 7.

The slice function is used to select only the top 10 rows of the resulting data frame, based on their order in the original data frame.

So overall, this code is filtering the data to exclude any rows where the customer_id starts with "Z", grouping the data by customer_id, calculating the total amount spent by each customer, calculating the mean sum_amount across all customers, creating a data frame of customers whose total amount spent is greater than or equal to the mean amount spent by all customers, and selecting the top 10 customers based on their total amount spent.

R-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.

In [ ]:

inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd") %>% slice(1:10)

A data.frame: 10 × 10

sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount	store_name
<int>	<int>	<chr>	<int>	<int>	<chr>	<chr>	<int>	<int>	<chr>
20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158	葛が谷店
20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81	成城店
20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170	二ツ橋店
20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25	新山下店
20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90	大和店
20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138	狛江店
20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30	三田店
20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128	長津田店
20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770	十条仲原店
20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680	南藤沢店

Commentary :

This code performs an inner join operation between two data frames, df_receipt and df_store, based on a common column "store_cd". The result of this operation is then piped into the slice() function to select the first 10 rows of the resulting data frame.

Let's break down the code step by step:

inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd"): This performs an inner join between the df_receipt data frame and a subset of df_store that only contains the columns "store_cd" and "store_name". The join is done on the "store_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "store_cd" column in both data frames.

%>%: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.

slice(1:10): This function selects the first 10 rows of the resulting data frame from the previous command.

So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation between df_receipt and a subset of df_store. The resulting data frame will only contain rows where there is a match between the "store_cd" column in both data frames, and it will have the columns "store_cd" and "store_name" from df_store, along with all the columns from df_receipt.

R-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.

In [ ]:

inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd") %>% slice(1:10)

A data.frame: 10 × 7

product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_small_name
<chr>	<chr>	<chr>	<chr>	<int>	<int>	<chr>
P040101001	04	0401	040101	198	149	弁当類
P040101002	04	0401	040101	218	164	弁当類
P040101003	04	0401	040101	230	173	弁当類
P040101004	04	0401	040101	248	186	弁当類
P040101005	04	0401	040101	268	201	弁当類
P040101006	04	0401	040101	298	224	弁当類
P040101007	04	0401	040101	338	254	弁当類
P040101008	04	0401	040101	420	315	弁当類
P040101009	04	0401	040101	498	374	弁当類
P040101010	04	0401	040101	580	435	弁当類

Commentary :

This code performs an inner join operation between two data frames, df_product and df_category, based on a common column "category_small_cd". The result of this operation is then piped into the slice() function to select the first 10 rows of the resulting data frame.

Let's break down the code step by step:

inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd"): This performs an inner join between the df_product data frame and a subset of df_category that only contains the columns "category_small_cd" and "category_small_name". The join is done on the "category_small_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "category_small_cd" column in both data frames.

%>%: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.

slice(1:10): This function selects the first 10 rows of the resulting data frame from the previous command.

So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation between df_product and a subset of df_category. The resulting data frame will only contain rows where there is a match between the "category_small_cd" column in both data frames, and it will have the columns "category_small_cd" and "category_small_name" from df_category, along with all the columns from df_product.

R-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.

In [ ]:

df_sum <- df_receipt %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") 

df_target <- df_customer %>% 
 filter(gender_cd == "1" & !grepl("^Z", customer_id)) 

left_join(df_target["customer_id"], df_sum, by = "customer_id") %>% 
 replace_na(list(sum_amount = 0)) %>% slice(1:10)

A data.frame: 10 × 2

customer_id	sum_amount
<chr>	<dbl>
CS021313000114	0
CS031415000172	5088
CS028811000001	0
CS001215000145	875
CS015414000103	3122
CS033513000180	868
CS035614000014	0
CS011215000048	3444
CS009413000079	0
CS040412000191	210

Commentary :

This code performs several operations to obtain a summary table for the customer data. The result of this operation is then piped into the replace_na() function to replace any missing values with a default value of 0. Finally, the result is piped into the slice() function to select the first 10 rows of the resulting data frame.

Let's break down the code step by step:

df_sum <- df_receipt %>% group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop"): This code first groups the df_receipt data frame by the "customer_id" column, and then summarizes the "amount" column using the sum() function. The resulting data frame, df_sum, contains the total amount spent by each customer. The .groups parameter is set to "drop" to remove the grouping information from the output.

df_target <- df_customer %>% filter(gender_cd == "1" & !grepl("^Z", customer_id)): This code filters the df_customer data frame to only include rows where the "gender_cd" column equals "1" and the "customer_id" column does not start with the letter "Z". The resulting data frame, df_target, contains only male customers whose IDs do not start with "Z".

left_join(df_target["customer_id"], df_sum, by = "customer_id"): This performs a left join operation between df_target["customer_id"] and df_sum based on the "customer_id" column. The result is a new data frame that contains all the rows from df_target along with the corresponding total amount spent by each customer from df_sum.

replace_na(list(sum_amount = 0)): This replaces any missing values in the "sum_amount" column with a default value of 0.

%>% slice(1:10): This function selects the first 10 rows of the resulting data frame from the previous command.

So, the overall purpose of this code is to obtain a summary table of the total amount spent by male customers whose IDs do not start with the letter "Z". The resulting data frame will have two columns, "customer_id" and "sum_amount", where "sum_amount" is the total amount spent by each customer. Any missing values in the "sum_amount" column will be replaced with 0, and the resulting data frame will be limited to the first 10 rows.

R-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).

In [ ]:

# Code example 1 
df_data <- df_receipt %>% 
 filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) 

df_cnt <- df_data %>% 
 summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>%
 arrange(desc(come_days), customer_id) %>% 
 slice(1:20) 

df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>%
 arrange(desc(sum_amount)) %>% 
 slice(1:20) 

full_join(df_cnt, df_sum, by = "customer_id")

A tibble: 34 × 3

customer_id	come_days	sum_amount
<chr>	<int>	<int>
CS040214000008	23	NA
CS010214000010	22	18585
CS015415000185	22	20153
CS010214000002	21	NA
CS028415000007	21	19127
CS016415000141	20	18372
CS017415000097	20	23086
CS014214000023	19	NA
CS021514000045	19	NA
CS021515000172	19	NA
CS022515000226	19	NA
CS031414000051	19	19202
CS039414000052	19	NA
CS007515000107	18	NA
CS014415000077	18	NA
CS021515000056	18	NA
CS021515000211	18	NA
CS022515000028	18	NA
CS030214000008	18	NA
CS031414000073	18	NA
CS001605000009	NA	18925
CS006515000023	NA	18372
CS011414000106	NA	18338
CS038415000104	NA	17847
CS035414000024	NA	17615
CS021515000089	NA	17580
CS032414000072	NA	16563
CS016415000101	NA	16348
CS011415000006	NA	16094
CS034415000047	NA	16083
CS007514000094	NA	15735
CS009414000059	NA	15492
CS030415000034	NA	15468
CS015515000034	NA	15300

Commentary :

This code performs several operations to obtain two summary tables for the customer data, and then performs a full join operation between these two tables based on the "customer_id" column. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

Let's break down the code step by step:

df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id): This code first filters the df_receipt data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column. The resulting df_data data frame contains all the rows from df_receipt with "Z" customer IDs removed, and is grouped by "customer_id".

df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>% arrange(desc(come_days), customer_id) %>% slice(1:20): This code first summarizes the df_data data frame by counting the number of distinct "sales_ymd" values for each customer using the n_distinct() function. The resulting data frame, df_cnt, contains the total number of unique sales days for each customer. The data frame is then arranged in descending order of "come_days" followed by "customer_id". The first 20 rows of the resulting data frame are then selected using the slice() function.

df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>% arrange(desc(sum_amount)) %>% slice(1:20): This code summarizes the df_data data frame by calculating the sum of the "amount" column for each customer using the sum() function. The resulting data frame, df_sum, contains the total amount spent by each customer. The data frame is then arranged in descending order of "sum_amount", and the first 20 rows of the resulting data frame are selected using the slice() function.

full_join(df_cnt, df_sum, by = "customer_id"): This code performs a full join operation between df_cnt and df_sum based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

In [ ]:

# Code example 2
# Code example using slice_max. 
# Rewrite with_ties = FALSE to with_ties = TRUE to change the form to include the same rank 

df_data <- df_receipt %>% 
 filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) 

df_cnt <- df_data %>% 
 summarise(come_days = n_distinct(sales_ymd)) %>% 
 slice_max(come_days, n = 20, with_ties = FALSE) 

df_sum <- df_data %>% 
 summarise(sum_amount = sum(amount)) %>% 
 slice_max(sum_amount, n = 20, with_ties = FALSE) 

full_join(df_cnt, df_sum, by = "customer_id")

A tibble: 34 × 3

customer_id	come_days	sum_amount
<chr>	<int>	<int>
CS040214000008	23	NA
CS010214000010	22	18585
CS015415000185	22	20153
CS010214000002	21	NA
CS028415000007	21	19127
CS016415000141	20	18372
CS017415000097	20	23086
CS014214000023	19	NA
CS021514000045	19	NA
CS021515000172	19	NA
CS022515000226	19	NA
CS031414000051	19	19202
CS039414000052	19	NA
CS007515000107	18	NA
CS014415000077	18	NA
CS021515000056	18	NA
CS021515000211	18	NA
CS022515000028	18	NA
CS030214000008	18	NA
CS031414000073	18	NA
CS001605000009	NA	18925
CS006515000023	NA	18372
CS011414000106	NA	18338
CS038415000104	NA	17847
CS035414000024	NA	17615
CS021515000089	NA	17580
CS032414000072	NA	16563
CS016415000101	NA	16348
CS011415000006	NA	16094
CS034415000047	NA	16083
CS007514000094	NA	15735
CS009414000059	NA	15492
CS030415000034	NA	15468
CS015515000034	NA	15300

Commentary :

This code is similar to the previous code we discussed, but it uses the slice_max() function instead of the arrange() function to obtain the top 20 customers based on the number of distinct sales days and the total amount spent.

Here is a breakdown of the code:

df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id): This code is similar to the previous code, where it filters the df_receipt data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column.

df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd)) %>% slice_max(come_days, n = 20, with_ties = FALSE): This code summarizes the df_data data frame by counting the number of distinct "sales_ymd" values for each customer using the n_distinct() function. The resulting data frame, df_cnt, contains the total number of unique sales days for each customer. The slice_max() function is used to select the top 20 rows based on the "come_days" column, where with_ties = FALSE specifies that only the top 20 distinct values should be selected.

df_sum <- df_data %>% summarise(sum_amount = sum(amount)) %>% slice_max(sum_amount, n = 20, with_ties = FALSE): This code summarizes the df_data data frame by calculating the sum of the "amount" column for each customer using the sum() function. The resulting data frame, df_sum, contains the total amount spent by each customer. The slice_max() function is used to select the top 20 rows based on the "sum_amount" column, where with_ties = FALSE specifies that only the top 20 distinct values should be selected.

full_join(df_cnt, df_sum, by = "customer_id"): This code performs a full join operation between df_cnt and df_sum based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent, obtained using the slice_max() function instead of the arrange() function.

R-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.

In [ ]:

df_store_tmp <- df_store 

df_product_tmp <- df_product 

df_store_tmp["key"] <- 0 

df_product_tmp["key"] <- 0 

nrow(full_join(df_store_tmp, df_product_tmp, by = "key"))

Commentary :

This code performs a full join operation between the df_store and df_product data frames based on a common "key" column that is added to both data frames. The resulting data frame is then used to calculate the number of rows using the nrow() function.

Here is a breakdown of the code:

df_store_tmp <- df_store: This code creates a new data frame called df_store_tmp that is a copy of the df_store data frame. This is done to avoid modifying the original data frame.

df_product_tmp <- df_product: This code creates a new data frame called df_product_tmp that is a copy of the df_product data frame. This is done to avoid modifying the original data frame.

df_store_tmp["key"] <- 0: This code adds a new column called "key" to the df_store_tmp data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.

df_product_tmp["key"] <- 0: This code adds a new column called "key" to the df_product_tmp data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.

full_join(df_store_tmp, df_product_tmp, by = "key"): This code performs a full join operation between the df_store_tmp and df_product_tmp data frames based on the common "key" column. Since both data frames have the same value in this column for all rows, a full join will result in a Cartesian product of the two data frames, meaning that every row from df_store_tmp will be paired with every row from df_product_tmp.

nrow(full_join(df_store_tmp, df_product_tmp, by = "key")): This code calculates the number of rows in the resulting data frame from the full join operation using the nrow() function. The result is the total number of combinations of rows between df_store and df_product.

In summary, this code creates copies of the df_store and df_product data frames, adds a common "key" column to both data frames with all values set to zero, performs a full join operation between the two data frames based on the common "key" column, and calculates the number of resulting rows. The purpose of this code is to determine the total number of combinations of rows between the df_store and df_product data frames.

Data Science 100 Knocks (Structured Data Processing) - R

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - R Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - R Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - R Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - R Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - R Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.