参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.

The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.

Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
 To begin with, run the following cells
 Import the required libraries and read the data from the database (PostgreSQL)
 The libraries that are expected to be used are imported in the following cells
 If there are other libraries you wish to use, install them as appropriate using install.packages().
 Names, addresses, etc. are dummy data and are not real.
require("RPostgreSQL")
require("tidyr")
require("dplyr")
require("stringr")
require("caret")
require("lubridate")
require("rsample")
require("recipes")
require("themis")
host < "db"
port < Sys.getenv()["PG_PORT"]
dbname < Sys.getenv()["PG_DATABASE"]
user < Sys.getenv()["PG_USER"]
password < Sys.getenv()["PG_PASSWORD"]
con < dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password)
df_customer < dbGetQuery(con,"SELECT * FROM customer")
df_category < dbGetQuery(con,"SELECT * FROM category")
df_product < dbGetQuery(con,"SELECT * FROM product")
df_receipt < dbGetQuery(con,"SELECT * FROM receipt")
df_store < dbGetQuery(con,"SELECT * FROM store")
df_geocode < dbGetQuery(con,"SELECT * FROM geocode")
Loading required package: RPostgreSQL Loading required package: DBI Loading required package: tidyr Loading required package: dplyr Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: stringr Loading required package: caret Loading required package: ggplot2 Loading required package: lattice Loading required package: lubridate Attaching package: ‘lubridate’ The following objects are masked from ‘package:base’: date, intersect, setdiff, union Loading required package: rsample Loading required package: recipes Attaching package: ‘recipes’ The following object is masked from ‘package:stringr’: fixed The following object is masked from ‘package:stats’: step Loading required package: themis Registered S3 methods overwritten by 'themis': method from bake.step_downsample recipes bake.step_upsample recipes prep.step_downsample recipes prep.step_upsample recipes tidy.step_downsample recipes tidy.step_upsample recipes tunable.step_downsample recipes tunable.step_upsample recipes Attaching package: ‘themis’ The following objects are masked from ‘package:recipes’: step_downsample, step_upsample
Practice Questions
R021: Count the number of cases against the receipt details data (df_receipt).
nrow(df_receipt)
104681
Commentary :
The code "nrow(df_receipt)" is written in R or a similar statistical programming language. It is used to determine the number of rows in a data frame called "df_receipt".
Here's a breakdown of the code:
"nrow()" is a builtin R function that returns the number of rows in a data frame.
"df_receipt" is the name of the data frame whose number of rows we want to determine.
For example, if the data frame "df_receipt" contains 100 rows of data, then running this code would return the value 100. This code is useful for quickly checking the size of a data frame and ensuring that it has the expected number of rows before performing further analysis or manipulating the data in some way.
R022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).
length(unique(df_receipt$customer_id))
8307
Commentary :
The code "length(unique(df_receipt$customer_id))" is written in R or a similar statistical programming language. It is used to determine the number of unique customers in a data frame called "df_receipt".
Here's a breakdown of the code:
"unique()" is an R function that returns a vector containing only the unique elements of a vector. In this case, the vector being passed to the "unique()" function is "df_receipt$customer_id", which is a column in the "df_receipt" data frame that contains customer IDs.
"length()" is another builtin R function that returns the number of elements in a vector. In this case, the vector being passed to "length()" is the output of the "unique()" function, which is a vector containing unique customer IDs.
Putting it all together, the code calculates the number of unique customers in the "df_receipt" data frame by taking the length of a vector of unique customer IDs. For example, if the data frame contains 1000 rows, but only 500 unique customer IDs, then running this code would return the value 500. This code is useful for understanding the number of unique customers in a data set and for calculating customerlevel metrics such as customer lifetime value.
R023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).
df_receipt %>%
group_by(store_cd) %>%
summarise(amount = sum(amount), quantity = sum(quantity), .groups = "drop")
store_cd  amount  quantity 

<chr>  <int>  <int> 
S12007  638761  2099 
S12013  787513  2425 
S12014  725167  2358 
S12029  794741  2555 
S12030  684402  2403 
S13001  811936  2347 
S13002  727821  2340 
S13003  764294  2197 
S13004  779373  2390 
S13005  629876  2004 
S13008  809288  2491 
S13009  808870  2486 
S13015  780873  2248 
S13016  793773  2432 
S13017  748221  2376 
S13018  790535  2562 
S13019  827833  2541 
S13020  796383  2383 
S13031  705968  2336 
S13032  790501  2491 
S13035  715869  2219 
S13037  693087  2344 
S13038  708884  2337 
S13039  611888  1981 
S13041  728266  2233 
S13043  587895  1881 
S13044  520764  1729 
S13051  107452  354 
S13052  100314  250 
S14006  712839  2284 
S14010  790361  2290 
S14011  805724  2434 
S14012  720600  2412 
S14021  699511  2231 
S14022  651328  2047 
S14023  727630  2258 
S14024  736323  2417 
S14025  755581  2394 
S14026  824537  2503 
S14027  714550  2303 
S14028  786145  2458 
S14033  725318  2282 
S14034  653681  2024 
S14036  203694  635 
S14040  701858  2233 
S14042  534689  1935 
S14045  458484  1398 
S14046  412646  1354 
S14047  338329  1041 
S14048  234276  769 
S14049  230808  788 
S14050  167090  580 
Commentary :
The code "df_receipt %>% group_by(store_cd) %>% summarise(amount = sum(amount), quantity = sum(quantity), .groups = 'drop')" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "store_cd" and then calculate the sum of the "amount" and "quantity" variables within each store group.
Here's a breakdown of the code:
The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()" and "summarise()" functions together.
"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "store_cd" variable.
"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the sum of the "amount" and "quantity" variables within each store group. We are also specifying the names of the new variables using the assignment operator "=".
The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.
The result of this code will be a new data frame with one row for each store in the "df_receipt" data frame. The "amount" and "quantity" columns will contain the sums of those variables within each store group. This code is useful for calculating storelevel metrics such as total sales or units sold.
R024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).
df_receipt %>%
group_by(customer_id) %>%
summarise(max_ymd = max(sales_ymd), .groups = "drop") %>%
slice(1:10)
customer_id  max_ymd 

<chr>  <int> 
CS001113000004  20190308 
CS001114000005  20190731 
CS001115000010  20190405 
CS001205000004  20190625 
CS001205000006  20190224 
CS001211000025  20190322 
CS001212000027  20170127 
CS001212000031  20180906 
CS001212000046  20170811 
CS001212000070  20191018 
Commentary :
The code "df_receipt %>% group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd), .groups = 'drop') %>% slice(1:10)" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "customer_id" and then find the latest date of purchase for each customer. Finally, it slices the first 10 rows of the resulting data frame.
Here's a breakdown of the code:
The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()", "summarise()", and "slice()" functions together.
"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "customer_id" variable.
"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the latest purchase date for each customer using the "max()" function on the "sales_ymd" variable.
The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.
"slice()" is an R function that is used to extract a subset of rows from a data frame. In this case, we are selecting the first 10 rows of the resulting data frame.
The result of this code will be a new data frame with one row for each customer in the "df_receipt" data frame. The "max_ymd" column will contain the latest purchase date for each customer. Finally, the code selects the first 10 rows of the resulting data frame. This code is useful for understanding the most recent purchase date for each customer and can be used for various customerlevel analysis and segmentation purposes.
R025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.
df_receipt %>%
group_by(customer_id) %>%
summarise(min_ymd = min(sales_ymd), .groups = "drop") %>%
slice(1:10)
customer_id  min_ymd 

<chr>  <int> 
CS001113000004  20190308 
CS001114000005  20180503 
CS001115000010  20171228 
CS001205000004  20170914 
CS001205000006  20180207 
CS001211000025  20190322 
CS001212000027  20170127 
CS001212000031  20180906 
CS001212000046  20170811 
CS001212000070  20191018 
Commentary :
This code is using the%>%
(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt
.
Here's a stepbystep explanation of what each operation does:group_by(customer_id)
: This groups the data by thecustomer_id
column, so that subsequent operations are performed separately for each customer.summarise(min_ymd = min(sales_ymd), .groups = "drop")
: This creates a summary of the data within each group, calculating the minimum value of thesales_ymd
column and storing it in a new column calledmin_ymd
. The.groups = "drop"
argument tellsdplyr
to drop the grouping information from the resulting data frame (since we don't need it for further analysis).slice(1:10)
: This selects the first 10 rows of the resulting data frame. Since the data was grouped bycustomer_id
and the summary function was used, this will give us the earliestsales_ymd
value for the first 10 customers in the data set.
Overall, this code is selecting the earliestsales_ymd
value for each customer in thedf_receipt
data frame, and returning the earliest dates for the first 10 customers in the data set.
R026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.
df_receipt %>%
group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd) ,min_ymd = min(sales_ymd), .groups = "drop") %>%
filter(max_ymd != min_ymd) %>%
slice(1:10)
customer_id  max_ymd  min_ymd 

<chr>  <int>  <int> 
CS001114000005  20190731  20180503 
CS001115000010  20190405  20171228 
CS001205000004  20190625  20170914 
CS001205000006  20190224  20180207 
CS001214000009  20190902  20170306 
CS001214000017  20191006  20180828 
CS001214000048  20190929  20171109 
CS001214000052  20190617  20180208 
CS001215000005  20181021  20170206 
CS001215000040  20171022  20170214 
Commentary :
This code is using the%>%
(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt
.
Here's a stepbystep explanation of what each operation does:group_by(customer_id)
: This groups the data by thecustomer_id
column, so that subsequent operations are performed separately for each customer.summarise(max_ymd = max(sales_ymd), min_ymd = min(sales_ymd), .groups = "drop")
: This creates a summary of the data within each group, calculating the maximum and minimum values of thesales_ymd
column and storing them in new columns calledmax_ymd
andmin_ymd
, respectively. The.groups = "drop"
argument tellsdplyr
to drop the grouping information from the resulting data frame (since we don't need it for further analysis).filter(max_ymd != min_ymd)
: This filters the data frame to include only those rows where themax_ymd
value is not equal to themin_ymd
value. This will exclude any customers who made only a single purchase during the period covered by the data set.slice(1:10)
: This selects the first 10 rows of the resulting data frame. Since the data was grouped bycustomer_id
, summarized using themax
andmin
functions, and filtered based on those summary statistics, this will give us the first 10 customers who made more than one purchase during the period covered by the data set.
Overall, this code is selecting the customers who made more than one purchase during the period covered by thedf_receipt
data frame, and returning the dates of their earliest and latest purchases. The code then returns the earliest and latest purchases for the first 10 customers who meet these criteria.
R027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.
df_receipt %>%
group_by(store_cd) %>%
summarise(mean_amount = mean(amount), .groups = "drop") %>%
arrange(desc(mean_amount)) %>%
slice(1:5)
store_cd  mean_amount 

<chr>  <dbl> 
S13052  402.8675 
S13015  351.1120 
S13003  350.9155 
S14010  348.7913 
S13001  348.4704 
Commentary :
This code is using the%>%
(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt
.
Here's a stepbystep explanation of what each operation does:group_by(store_cd)
: This groups the data by thestore_cd
column, so that subsequent operations are performed separately for each store.summarise(mean_amount = mean(amount), .groups = "drop")
: This creates a summary of the data within each group, calculating the mean value of theamount
column and storing it in a new column calledmean_amount
. The.groups = "drop"
argument tellsdplyr
to drop the grouping information from the resulting data frame (since we don't need it for further analysis).arrange(desc(mean_amount))
: This sorts the resulting data frame by themean_amount
column in descending order (from highest to lowest).slice(1:5)
: This selects the first 5 rows of the resulting data frame. Since the data was grouped bystore_cd
and summarized using themean
function, this will give us the stores with the highest meanamount
values across all transactions.
Overall, this code is selecting the stores with the highest meanamount
values across all transactions in thedf_receipt
data frame, and returning the top 5 stores based on those values.
R028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.
df_receipt %>%
group_by(store_cd) %>%
summarise(median_amount = median(amount), .groups = "drop") %>% arrange(desc(median_amount)) %>%
slice(1:5)
store_cd  median_amount 

<chr>  <dbl> 
S13052  190 
S14010  188 
S14050  185 
S13003  180 
S13018  180 
Commentary :
This code is using the%>%
(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt
.
Here's a stepbystep explanation of what each operation does:group_by(store_cd)
: This groups the data by thestore_cd
column, so that subsequent operations are performed separately for each store.summarise(median_amount = median(amount), .groups = "drop")
: This creates a summary of the data within each group, calculating the median value of theamount
column and storing it in a new column calledmedian_amount
. The.groups = "drop"
argument tellsdplyr
to drop the grouping information from the resulting data frame (since we don't need it for further analysis).arrange(desc(median_amount))
: This sorts the resulting data frame by themedian_amount
column in descending order (from highest to lowest).slice(1:5)
: This selects the first 5 rows of the resulting data frame. Since the data was grouped bystore_cd
and summarized using themedian
function, this will give us the stores with the highest medianamount
values across all transactions.
Overall, this code is selecting the stores with the highest medianamount
values across all transactions in thedf_receipt
data frame, and returning the top 5 stores based on those values. The difference between this code and the previous one I explained is that this code is using themedian
function instead of themean
function to calculate the central tendency of theamount
values for each store.
R029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.
# Code example 1
df_receipt %>%
group_by(store_cd, product_cd) %>%
summarise(cnt = n(), .groups = "drop_last") %>%
filter(cnt == cnt %>% max()) %>%
ungroup() %>%
slice(1:10)
store_cd  product_cd  cnt 

<chr>  <chr>  <int> 
S12007  P060303001  72 
S12013  P060303001  107 
S12014  P060303001  65 
S12029  P060303001  92 
S12030  P060303001  115 
S13001  P060303001  67 
S13002  P060303001  78 
S13003  P071401001  65 
S13004  P060303001  88 
S13005  P040503001  36 
Commentary :
This code is using thedplyr
package to perform data manipulation on a data frame calleddf_receipt
. Here is a stepbystep explanation of what the code is doing:%>%
is a pipe operator used to chain functions together, so the output of one function becomes the input of the next function.
Thegroup_by
function groups the data frame bystore_cd
andproduct_cd
.
Thesummarise
function calculates a summary statistic for each group. In this case, it is counting the number of rows in each group and creating a new column calledcnt
.
The.groups = "drop_last"
argument is used to drop the last grouping level created by thesummarise
function, which is theproduct_cd
grouping. This is done to make it easier to filter on the maximum count later on.
Thefilter
function selects rows that meet a certain condition. In this case, it is selecting rows where thecnt
column is equal to the maximum value of thecnt
column in the data frame.
Theungroup
function removes all grouping levels from the data frame.
Theslice
function selects a subset of rows based on their position. In this case, it is selecting the first 10 rows of the data frame.
So overall, this code is grouping the data by store and product, calculating the count of each group, filtering for the groups with the highest count, and then selecting the top 10 results.
# Code example 2: using which.max() (multiple occurrences of mode number, narrowed down to one of them)
# Display of the number of cases is omitted
table_product <table(df_receipt$store_cd,df_receipt$product_cd)
store < names(table_product[,1])
mode_product < c()
for (i in 1:length(store)){
mode_product[i] < names(which.max(table_product[i,]))
}
data.frame(store_cd = store, product_cd = mode_product) %>%
slice(1:10)
store_cd  product_cd 

<chr>  <chr> 
S12007  P060303001 
S12013  P060303001 
S12014  P060303001 
S12029  P060303001 
S12030  P060303001 
S13001  P060303001 
S13002  P060303001 
S13003  P071401001 
S13004  P060303001 
S13005  P040503001 
Commentary :
This code is using thetable
function andfor
loop to find the mode (most frequent value) of each product code for each store in thedf_receipt
data frame.
Here's a stepbystep explanation of what the code is doing:
Thetable
function creates a contingency table of counts for each combination ofstore_cd
andproduct_cd
indf_receipt
. The resulting object,table_product
, is a matrix where the rows correspond to each store and the columns correspond to each product code.
Thenames
function extracts the store codes from the first column of thetable_product
matrix and saves them as a character vector calledstore
.
An empty vector calledmode_product
is created to store the mode of each product code for each store.
Thefor
loop iterates through each store instore
and finds the mode of each product code for that store. Thewhich.max
function is used to find the index of the maximum count in each row of thetable_product
matrix, and thenames
function is used to extract the product code with that index. The resulting product code is stored in themode_product
vector.
A data frame is created with two columns:store_cd
andproduct_cd
. Thestore_cd
column is populated with the values in thestore
vector, and theproduct_cd
column is populated with the values in themode_product
vector.
The%>%
operator is used to pipe the data frame to theslice
function, which selects the first 10 rows of the data frame.
So overall, this code is finding the mode of each product code for each store indf_receipt
, creating a data frame with these values, and then selecting the top 10 results.
R030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.
var_sample < function(x){ var(x) * (length(x)  1) / length(x) }
df_receipt %>%
group_by(store_cd) %>%
summarise(var_amount = var_sample(amount), .groups = "drop") %>%
arrange(desc(var_amount)) %>% slice(1:5)
store_cd  var_amount 

<chr>  <dbl> 
S13052  440088.7 
S14011  306314.6 
S14034  296920.1 
S13001  295432.0 
S13015  295294.4 
Commentary :
This code is using thedplyr
package to calculate the sample variance of theamount
variable for each store in thedf_receipt
data frame, and then selecting the top 5 stores with the highest variance.
Here's a stepbystep explanation of what the code is doing:
Thevar_sample
function is defined, which calculates the sample variance of a vector of valuesx
using the formulavar(x) * (length(x)  1) / length(x)
. This formula adjusts the variance by dividing bylength(x)  1
instead oflength(x)
to provide an unbiased estimate of the population variance.
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thegroup_by
function is used to group the data bystore_cd
.
Thesummarise
function calculates a summary statistic for each group. In this case, it is using thevar_sample
function to calculate the sample variance of theamount
variable for each group, and creating a new column calledvar_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thearrange
function is used to sort the data frame byvar_amount
in descending order, so that the stores with the highest variance appear first.
The%>%
operator is used to pipe the sorted data frame to theslice
function, which selects the first 5 rows of the data frame.
So overall, this code is grouping the data by store, calculating the sample variance of theamount
variable for each group, sorting the resulting data frame by variance in descending order, and then selecting the top 5 results.
R031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.
var_sample < function(x){ var(x)*(length(x)1)/length(x) }
std_sample < function(x){ sqrt(var_sample(x)) }
head(
df_receipt %>%
group_by(store_cd) %>%
summarise(std_amount = std_sample(amount), .groups = "drop") %>%
arrange(desc(std_amount)) ,
n = 5
)
store_cd  std_amount 

<chr>  <dbl> 
S13052  663.3918 
S14011  553.4569 
S14034  544.9037 
S13001  543.5366 
S13015  543.4099 
Commentary :
This code is using thedplyr
package to calculate the sample standard deviation of theamount
variable for each store in thedf_receipt
data frame, and then selecting the top 5 stores with the highest standard deviation.
Here's a stepbystep explanation of what the code is doing:
Thevar_sample
function is defined, which calculates the sample variance of a vector of valuesx
using the formulavar(x) * (length(x)  1) / length(x)
. This formula adjusts the variance by dividing bylength(x)  1
instead oflength(x)
to provide an unbiased estimate of the population variance.
Thestd_sample
function is defined, which calculates the sample standard deviation of a vector of valuesx
using the formulasqrt(var_sample(x))
.
Thehead
function is used to select the first 5 rows of the resulting data frame.
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thegroup_by
function is used to group the data bystore_cd
.
Thesummarise
function calculates a summary statistic for each group. In this case, it is using thestd_sample
function to calculate the sample standard deviation of theamount
variable for each group, and creating a new column calledstd_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thearrange
function is used to sort the data frame bystd_amount
in descending order, so that the stores with the highest standard deviation appear first.
Then = 5
argument is used to limit the output to the top 5 rows.
So overall, this code is grouping the data by store, calculating the sample standard deviation of theamount
variable for each group, sorting the resulting data frame by standard deviation in descending order, and then selecting the top 5 results.
R032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.
df_receipt %>%
summarise(amount_25per = quantile(amount, 0.25),
amount_50per = quantile(amount, 0.5),
amount_75per = quantile(amount, 0.75),
amount_100per = quantile(amount, 1.0))
amount_25per  amount_50per  amount_75per  amount_100per 

<dbl>  <dbl>  <dbl>  <dbl> 
102  170  288  10925 
Commentary :
This code is using thedplyr
package to calculate the quartiles of theamount
variable in thedf_receipt
data frame.
Here's a stepbystep explanation of what the code is doing:
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thesummarise
function is used to calculate summary statistics for the data frame. In this case, it is using thequantile
function to calculate the 25th, 50th (median), 75th, and 100th percentiles of theamount
variable.
The resulting data frame will have 1 row and 4 columns, with the column namesamount_25per
,amount_50per
,amount_75per
, andamount_100per
, and the corresponding values of the quartiles.
So overall, this code is calculating the quartiles of theamount
variable in thedf_receipt
data frame, and returning a summary data frame with the quartile values.
R033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.
df_receipt %>%
group_by(store_cd) %>%
summarise(mean_amount = mean(amount), .groups = "drop") %>%
filter(mean_amount >= 330)
store_cd  mean_amount 

<chr>  <dbl> 
S12013  330.1941 
S13001  348.4704 
S13003  350.9155 
S13004  330.9439 
S13015  351.1120 
S13019  330.2086 
S13020  337.8799 
S13052  402.8675 
S14010  348.7913 
S14011  335.7183 
S14026  332.3406 
S14045  330.0821 
S14047  330.0771 
Commentary :
This code is using thedplyr
package to group thedf_receipt
data frame bystore_cd
, calculate the meanamount
for each group, and then filter the resulting data frame to include only those groups where the meanamount
is greater than or equal to 330.
Here's a stepbystep explanation of what the code is doing:
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thegroup_by
function is used to group the data bystore_cd
.
Thesummarise
function is used to calculate a summary statistic for each group. In this case, it is using themean
function to calculate the meanamount
for each group, and creating a new column calledmean_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thefilter
function is used to subset the data frame to only include rows where themean_amount
is greater than or equal to 330.
So overall, this code is grouping the data bystore_cd
, calculating the meanamount
for each group, and then selecting only those groups where the meanamount
is greater than or equal to 330. The resulting data frame will include one row for each qualifying store, with the store code and mean amount.
R034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent nonmembers.
df_mean < df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
summarise(mean_amount = mean(sum_amount))
df_mean$mean_amount
2547.74223452926
Commentary :
This code is using thedplyr
package to calculate the mean amount spent by each customer, after excluding any customer whosecustomer_id
starts with the letter "Z".
Here's a stepbystep explanation of what the code is doing:
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thefilter
function is used to exclude any rows where thecustomer_id
starts with the letter "Z".
Thegroup_by
function is used to group the data bycustomer_id
.
Thesummarise
function is used to calculate a summary statistic for each group. In this case, it is using thesum
function to calculate the totalamount
spent by each customer, and creating a new column calledsum_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Anothersummarise
function is used to calculate the meansum_amount
across all customers.
The resultingmean_amount
value is extracted using$mean_amount
.
So overall, this code is filtering the data to exclude any rows where thecustomer_id
starts with "Z", grouping the data bycustomer_id
, calculating the totalamount
spent by each customer, calculating the meansum_amount
across all customers, and returning the resulting mean value. The resulting output is a single numeric value representing the mean amount spent by each customer.
R035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent nonmembers.
df_sum < df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop")
df_mean < df_sum %>%
summarise(mean_amount = mean(sum_amount))
df_sum %>%
filter(sum_amount >= df_mean$mean_amount) %>%
slice(1:10)
customer_id  sum_amount 

<chr>  <int> 
CS001115000010  3044 
CS001205000006  3337 
CS001214000009  4685 
CS001214000017  4132 
CS001214000052  5639 
CS001215000040  3496 
CS001304000006  3726 
CS001305000005  3485 
CS001305000011  4370 
CS001315000180  3300 
Commentary :
This code is using thedplyr
package to filter the customers who have spent an amount greater than or equal to the mean amount spent by all customers, after excluding any customer whosecustomer_id
starts with the letter "Z". It then selects the top 10 customers based on their totalamount
spent.
Here's a stepbystep explanation of what the code is doing:
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thefilter
function is used to exclude any rows where thecustomer_id
starts with the letter "Z".
Thegroup_by
function is used to group the data bycustomer_id
.
Thesummarise
function is used to calculate a summary statistic for each group. In this case, it is using thesum
function to calculate the totalamount
spent by each customer, and creating a new column calledsum_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thedf_sum
data frame is created by storing the output of the above operations.
Thesummarise
function is used again to calculate the meansum_amount
across all customers indf_sum
.
The resultingmean_amount
value is extracted using$mean_amount
.
The%>%
operator is used to pipedf_sum
to the next function in the chain.
Thefilter
function is used to select only those rows where thesum_amount
is greater than or equal to themean_amount
calculated in step 7.
Theslice
function is used to select only the top 10 rows of the resulting data frame, based on their order in the original data frame.
So overall, this code is filtering the data to exclude any rows where thecustomer_id
starts with "Z", grouping the data bycustomer_id
, calculating the totalamount
spent by each customer, calculating the meansum_amount
across all customers, creating a data frame of customers whose totalamount
spent is greater than or equal to the mean amount spent by all customers, and selecting the top 10 customers based on their totalamount
spent.
R036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.
inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd") %>% slice(1:10)
sales_ymd  sales_epoch  store_cd  receipt_no  receipt_sub_no  customer_id  product_cd  quantity  amount  store_name 

<int>  <int>  <chr>  <int>  <int>  <chr>  <chr>  <int>  <int>  <chr> 
20181103  1541203200  S14006  112  1  CS006214000001  P070305012  1  158  葛が谷店 
20181118  1542499200  S13008  1132  2  CS008415000097  P070701017  1  81  成城店 
20170712  1499817600  S14028  1102  1  CS028414000014  P060101005  1  170  二ツ橋店 
20190205  1549324800  S14042  1132  1  ZZ000000000000  P050301001  1  25  新山下店 
20180821  1534809600  S14025  1102  2  CS025415000050  P060102007  1  90  大和店 
20190605  1559692800  S13003  1112  1  CS003515000195  P050102002  1  138  狛江店 
20181205  1543968000  S14024  1102  2  CS024514000042  P080101005  1  30  三田店 
20190922  1569110400  S14040  1102  1  CS040415000178  P070501004  1  128  長津田店 
20170504  1493856000  S13020  1112  2  ZZ000000000000  P071302010  1  770  十条仲原店 
20191010  1570665600  S14027  1102  1  CS027514000015  P071101003  1  680  南藤沢店 
Commentary :
This code performs an inner join operation between two data frames,df_receipt
anddf_store
, based on a common column "store_cd". The result of this operation is then piped into theslice()
function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd")
: This performs an inner join between thedf_receipt
data frame and a subset ofdf_store
that only contains the columns "store_cd" and "store_name". The join is done on the "store_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "store_cd" column in both data frames.%>%
: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.slice(1:10)
: This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation betweendf_receipt
and a subset ofdf_store
. The resulting data frame will only contain rows where there is a match between the "store_cd" column in both data frames, and it will have the columns "store_cd" and "store_name" fromdf_store
, along with all the columns fromdf_receipt
.
R037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category subcategory names (category_small_name) of the category data.
inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd") %>% slice(1:10)
product_cd  category_major_cd  category_medium_cd  category_small_cd  unit_price  unit_cost  category_small_name 

<chr>  <chr>  <chr>  <chr>  <int>  <int>  <chr> 
P040101001  04  0401  040101  198  149  弁当類 
P040101002  04  0401  040101  218  164  弁当類 
P040101003  04  0401  040101  230  173  弁当類 
P040101004  04  0401  040101  248  186  弁当類 
P040101005  04  0401  040101  268  201  弁当類 
P040101006  04  0401  040101  298  224  弁当類 
P040101007  04  0401  040101  338  254  弁当類 
P040101008  04  0401  040101  420  315  弁当類 
P040101009  04  0401  040101  498  374  弁当類 
P040101010  04  0401  040101  580  435  弁当類 
Commentary :
This code performs an inner join operation between two data frames,df_product
anddf_category
, based on a common column "category_small_cd". The result of this operation is then piped into theslice()
function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd")
: This performs an inner join between thedf_product
data frame and a subset ofdf_category
that only contains the columns "category_small_cd" and "category_small_name". The join is done on the "category_small_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "category_small_cd" column in both data frames.%>%
: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.slice(1:10)
: This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation betweendf_product
and a subset ofdf_category
. The resulting data frame will only contain rows where there is a match between the "category_small_cd" column in both data frames, and it will have the columns "category_small_cd" and "category_small_name" fromdf_category
, along with all the columns fromdf_product
.
R038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and nonmembers (whose customer ID starts with “Z”) should be excluded.
df_sum < df_receipt %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop")
df_target < df_customer %>%
filter(gender_cd == "1" & !grepl("^Z", customer_id))
left_join(df_target["customer_id"], df_sum, by = "customer_id") %>%
replace_na(list(sum_amount = 0)) %>% slice(1:10)
customer_id  sum_amount 

<chr>  <dbl> 
CS021313000114  0 
CS031415000172  5088 
CS028811000001  0 
CS001215000145  875 
CS015414000103  3122 
CS033513000180  868 
CS035614000014  0 
CS011215000048  3444 
CS009413000079  0 
CS040412000191  210 
Commentary :
This code performs several operations to obtain a summary table for the customer data. The result of this operation is then piped into thereplace_na()
function to replace any missing values with a default value of 0. Finally, the result is piped into theslice()
function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:df_sum < df_receipt %>% group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop")
: This code first groups thedf_receipt
data frame by the "customer_id" column, and then summarizes the "amount" column using thesum()
function. The resulting data frame,df_sum
, contains the total amount spent by each customer. The.groups
parameter is set to "drop" to remove the grouping information from the output.df_target < df_customer %>% filter(gender_cd == "1" & !grepl("^Z", customer_id))
: This code filters thedf_customer
data frame to only include rows where the "gender_cd" column equals "1" and the "customer_id" column does not start with the letter "Z". The resulting data frame,df_target
, contains only male customers whose IDs do not start with "Z".left_join(df_target["customer_id"], df_sum, by = "customer_id")
: This performs a left join operation betweendf_target["customer_id"]
anddf_sum
based on the "customer_id" column. The result is a new data frame that contains all the rows fromdf_target
along with the corresponding total amount spent by each customer fromdf_sum
.replace_na(list(sum_amount = 0))
: This replaces any missing values in the "sum_amount" column with a default value of 0.%>% slice(1:10)
: This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to obtain a summary table of the total amount spent by male customers whose IDs do not start with the letter "Z". The resulting data frame will have two columns, "customer_id" and "sum_amount", where "sum_amount" is the total amount spent by each customer. Any missing values in the "sum_amount" column will be replaced with 0, and the resulting data frame will be limited to the first 10 rows.
R039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude nonmembers (whose customer ID starts with “Z”).
# Code example 1
df_data < df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id)
df_cnt < df_data %>%
summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>%
arrange(desc(come_days), customer_id) %>%
slice(1:20)
df_sum < df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>%
arrange(desc(sum_amount)) %>%
slice(1:20)
full_join(df_cnt, df_sum, by = "customer_id")
customer_id  come_days  sum_amount 

<chr>  <int>  <int> 
CS040214000008  23  NA 
CS010214000010  22  18585 
CS015415000185  22  20153 
CS010214000002  21  NA 
CS028415000007  21  19127 
CS016415000141  20  18372 
CS017415000097  20  23086 
CS014214000023  19  NA 
CS021514000045  19  NA 
CS021515000172  19  NA 
CS022515000226  19  NA 
CS031414000051  19  19202 
CS039414000052  19  NA 
CS007515000107  18  NA 
CS014415000077  18  NA 
CS021515000056  18  NA 
CS021515000211  18  NA 
CS022515000028  18  NA 
CS030214000008  18  NA 
CS031414000073  18  NA 
CS001605000009  NA  18925 
CS006515000023  NA  18372 
CS011414000106  NA  18338 
CS038415000104  NA  17847 
CS035414000024  NA  17615 
CS021515000089  NA  17580 
CS032414000072  NA  16563 
CS016415000101  NA  16348 
CS011415000006  NA  16094 
CS034415000047  NA  16083 
CS007514000094  NA  15735 
CS009414000059  NA  15492 
CS030415000034  NA  15468 
CS015515000034  NA  15300 
Commentary :
This code performs several operations to obtain two summary tables for the customer data, and then performs a full join operation between these two tables based on the "customer_id" column. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
Let's break down the code step by step:df_data < df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id)
: This code first filters thedf_receipt
data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column. The resultingdf_data
data frame contains all the rows fromdf_receipt
with "Z" customer IDs removed, and is grouped by "customer_id".df_cnt < df_data %>% summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>% arrange(desc(come_days), customer_id) %>% slice(1:20)
: This code first summarizes thedf_data
data frame by counting the number of distinct "sales_ymd" values for each customer using then_distinct()
function. The resulting data frame,df_cnt
, contains the total number of unique sales days for each customer. The data frame is then arranged in descending order of "come_days" followed by "customer_id". The first 20 rows of the resulting data frame are then selected using theslice()
function.df_sum < df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>% arrange(desc(sum_amount)) %>% slice(1:20)
: This code summarizes thedf_data
data frame by calculating the sum of the "amount" column for each customer using thesum()
function. The resulting data frame,df_sum
, contains the total amount spent by each customer. The data frame is then arranged in descending order of "sum_amount", and the first 20 rows of the resulting data frame are selected using theslice()
function.full_join(df_cnt, df_sum, by = "customer_id")
: This code performs a full join operation betweendf_cnt
anddf_sum
based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
# Code example 2
# Code example using slice_max.
# Rewrite with_ties = FALSE to with_ties = TRUE to change the form to include the same rank
df_data < df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id)
df_cnt < df_data %>%
summarise(come_days = n_distinct(sales_ymd)) %>%
slice_max(come_days, n = 20, with_ties = FALSE)
df_sum < df_data %>%
summarise(sum_amount = sum(amount)) %>%
slice_max(sum_amount, n = 20, with_ties = FALSE)
full_join(df_cnt, df_sum, by = "customer_id")
customer_id  come_days  sum_amount 

<chr>  <int>  <int> 
CS040214000008  23  NA 
CS010214000010  22  18585 
CS015415000185  22  20153 
CS010214000002  21  NA 
CS028415000007  21  19127 
CS016415000141  20  18372 
CS017415000097  20  23086 
CS014214000023  19  NA 
CS021514000045  19  NA 
CS021515000172  19  NA 
CS022515000226  19  NA 
CS031414000051  19  19202 
CS039414000052  19  NA 
CS007515000107  18  NA 
CS014415000077  18  NA 
CS021515000056  18  NA 
CS021515000211  18  NA 
CS022515000028  18  NA 
CS030214000008  18  NA 
CS031414000073  18  NA 
CS001605000009  NA  18925 
CS006515000023  NA  18372 
CS011414000106  NA  18338 
CS038415000104  NA  17847 
CS035414000024  NA  17615 
CS021515000089  NA  17580 
CS032414000072  NA  16563 
CS016415000101  NA  16348 
CS011415000006  NA  16094 
CS034415000047  NA  16083 
CS007514000094  NA  15735 
CS009414000059  NA  15492 
CS030415000034  NA  15468 
CS015515000034  NA  15300 
Commentary :
This code is similar to the previous code we discussed, but it uses theslice_max()
function instead of thearrange()
function to obtain the top 20 customers based on the number of distinct sales days and the total amount spent.
Here is a breakdown of the code:df_data < df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id)
: This code is similar to the previous code, where it filters thedf_receipt
data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column.df_cnt < df_data %>% summarise(come_days = n_distinct(sales_ymd)) %>% slice_max(come_days, n = 20, with_ties = FALSE)
: This code summarizes thedf_data
data frame by counting the number of distinct "sales_ymd" values for each customer using then_distinct()
function. The resulting data frame,df_cnt
, contains the total number of unique sales days for each customer. Theslice_max()
function is used to select the top 20 rows based on the "come_days" column, wherewith_ties = FALSE
specifies that only the top 20 distinct values should be selected.df_sum < df_data %>% summarise(sum_amount = sum(amount)) %>% slice_max(sum_amount, n = 20, with_ties = FALSE)
: This code summarizes thedf_data
data frame by calculating the sum of the "amount" column for each customer using thesum()
function. The resulting data frame,df_sum
, contains the total amount spent by each customer. Theslice_max()
function is used to select the top 20 rows based on the "sum_amount" column, wherewith_ties = FALSE
specifies that only the top 20 distinct values should be selected.full_join(df_cnt, df_sum, by = "customer_id")
: This code performs a full join operation betweendf_cnt
anddf_sum
based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent, obtained using theslice_max()
function instead of thearrange()
function.
R040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.
df_store_tmp < df_store
df_product_tmp < df_product
df_store_tmp["key"] < 0
df_product_tmp["key"] < 0
nrow(full_join(df_store_tmp, df_product_tmp, by = "key"))
531590
Commentary :
This code performs a full join operation between thedf_store
anddf_product
data frames based on a common "key" column that is added to both data frames. The resulting data frame is then used to calculate the number of rows using thenrow()
function.
Here is a breakdown of the code:df_store_tmp < df_store
: This code creates a new data frame calleddf_store_tmp
that is a copy of thedf_store
data frame. This is done to avoid modifying the original data frame.df_product_tmp < df_product
: This code creates a new data frame calleddf_product_tmp
that is a copy of thedf_product
data frame. This is done to avoid modifying the original data frame.df_store_tmp["key"] < 0
: This code adds a new column called "key" to thedf_store_tmp
data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.df_product_tmp["key"] < 0
: This code adds a new column called "key" to thedf_product_tmp
data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.full_join(df_store_tmp, df_product_tmp, by = "key")
: This code performs a full join operation between thedf_store_tmp
anddf_product_tmp
data frames based on the common "key" column. Since both data frames have the same value in this column for all rows, a full join will result in a Cartesian product of the two data frames, meaning that every row fromdf_store_tmp
will be paired with every row fromdf_product_tmp
.nrow(full_join(df_store_tmp, df_product_tmp, by = "key"))
: This code calculates the number of rows in the resulting data frame from the full join operation using thenrow()
function. The result is the total number of combinations of rows betweendf_store
anddf_product
.
In summary, this code creates copies of thedf_store
anddf_product
data frames, adds a common "key" column to both data frames with all values set to zero, performs a full join operation between the two data frames based on the common "key" column, and calculates the number of resulting rows. The purpose of this code is to determine the total number of combinations of rows between thedf_store
anddf_product
data frames.
Comment