参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

Data Science 100 Knocks (Structured Data Processing) URL :

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.

The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.

Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

## Introduction

- To begin with, run the following cells
- Import the required libraries and read the data from the database (PostgreSQL)
- The libraries that are expected to be used are imported in the following cells
- If there are other libraries you wish to use, install them as appropriate using install.packages().
- Names, addresses, etc. are dummy data and are not real.

`require("RPostgreSQL") `

require("tidyr")

require("dplyr")

require("stringr")

require("caret")

require("lubridate")

require("rsample")

require("recipes")

require("themis")

host <- "db"

port <- Sys.getenv()["PG_PORT"]

dbname <- Sys.getenv()["PG_DATABASE"]

user <- Sys.getenv()["PG_USER"]

password <- Sys.getenv()["PG_PASSWORD"]

con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password)

df_customer <- dbGetQuery(con,"SELECT * FROM customer")

df_category <- dbGetQuery(con,"SELECT * FROM category")

df_product <- dbGetQuery(con,"SELECT * FROM product")

df_receipt <- dbGetQuery(con,"SELECT * FROM receipt")

df_store <- dbGetQuery(con,"SELECT * FROM store")

df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")

Loading required package: RPostgreSQL Loading required package: DBI Loading required package: tidyr Loading required package: dplyr Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: stringr Loading required package: caret Loading required package: ggplot2 Loading required package: lattice Loading required package: lubridate Attaching package: ‘lubridate’ The following objects are masked from ‘package:base’: date, intersect, setdiff, union Loading required package: rsample Loading required package: recipes Attaching package: ‘recipes’ The following object is masked from ‘package:stringr’: fixed The following object is masked from ‘package:stats’: step Loading required package: themis Registered S3 methods overwritten by 'themis': method from bake.step_downsample recipes bake.step_upsample recipes prep.step_downsample recipes prep.step_upsample recipes tidy.step_downsample recipes tidy.step_upsample recipes tunable.step_downsample recipes tunable.step_upsample recipes Attaching package: ‘themis’ The following objects are masked from ‘package:recipes’: step_downsample, step_upsample

## Practice Questions

R-021: Count the number of cases against the receipt details data (df_receipt).

nrow(df_receipt)

`104681`

Commentary :

The code "nrow(df_receipt)" is written in R or a similar statistical programming language. It is used to determine the number of rows in a data frame called "df_receipt".

Here's a breakdown of the code:

"nrow()" is a built-in R function that returns the number of rows in a data frame.

"df_receipt" is the name of the data frame whose number of rows we want to determine.

For example, if the data frame "df_receipt" contains 100 rows of data, then running this code would return the value 100. This code is useful for quickly checking the size of a data frame and ensuring that it has the expected number of rows before performing further analysis or manipulating the data in some way.

R-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).

length(unique(df_receipt$customer_id))

`8307`

Commentary :

The code "length(unique(df_receipt$customer_id))" is written in R or a similar statistical programming language. It is used to determine the number of unique customers in a data frame called "df_receipt".

Here's a breakdown of the code:

"unique()" is an R function that returns a vector containing only the unique elements of a vector. In this case, the vector being passed to the "unique()" function is "df_receipt$customer_id", which is a column in the "df_receipt" data frame that contains customer IDs.

"length()" is another built-in R function that returns the number of elements in a vector. In this case, the vector being passed to "length()" is the output of the "unique()" function, which is a vector containing unique customer IDs.

Putting it all together, the code calculates the number of unique customers in the "df_receipt" data frame by taking the length of a vector of unique customer IDs. For example, if the data frame contains 1000 rows, but only 500 unique customer IDs, then running this code would return the value 500. This code is useful for understanding the number of unique customers in a data set and for calculating customer-level metrics such as customer lifetime value.

R-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).

`df_receipt %>% `

group_by(store_cd) %>%

summarise(amount = sum(amount), quantity = sum(quantity), .groups = "drop")

store_cd | amount | quantity |
---|---|---|

<chr> | <int> | <int> |

S12007 | 638761 | 2099 |

S12013 | 787513 | 2425 |

S12014 | 725167 | 2358 |

S12029 | 794741 | 2555 |

S12030 | 684402 | 2403 |

S13001 | 811936 | 2347 |

S13002 | 727821 | 2340 |

S13003 | 764294 | 2197 |

S13004 | 779373 | 2390 |

S13005 | 629876 | 2004 |

S13008 | 809288 | 2491 |

S13009 | 808870 | 2486 |

S13015 | 780873 | 2248 |

S13016 | 793773 | 2432 |

S13017 | 748221 | 2376 |

S13018 | 790535 | 2562 |

S13019 | 827833 | 2541 |

S13020 | 796383 | 2383 |

S13031 | 705968 | 2336 |

S13032 | 790501 | 2491 |

S13035 | 715869 | 2219 |

S13037 | 693087 | 2344 |

S13038 | 708884 | 2337 |

S13039 | 611888 | 1981 |

S13041 | 728266 | 2233 |

S13043 | 587895 | 1881 |

S13044 | 520764 | 1729 |

S13051 | 107452 | 354 |

S13052 | 100314 | 250 |

S14006 | 712839 | 2284 |

S14010 | 790361 | 2290 |

S14011 | 805724 | 2434 |

S14012 | 720600 | 2412 |

S14021 | 699511 | 2231 |

S14022 | 651328 | 2047 |

S14023 | 727630 | 2258 |

S14024 | 736323 | 2417 |

S14025 | 755581 | 2394 |

S14026 | 824537 | 2503 |

S14027 | 714550 | 2303 |

S14028 | 786145 | 2458 |

S14033 | 725318 | 2282 |

S14034 | 653681 | 2024 |

S14036 | 203694 | 635 |

S14040 | 701858 | 2233 |

S14042 | 534689 | 1935 |

S14045 | 458484 | 1398 |

S14046 | 412646 | 1354 |

S14047 | 338329 | 1041 |

S14048 | 234276 | 769 |

S14049 | 230808 | 788 |

S14050 | 167090 | 580 |

Commentary :

The code "df_receipt %>% group_by(store_cd) %>% summarise(amount = sum(amount), quantity = sum(quantity), .groups = 'drop')" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "store_cd" and then calculate the sum of the "amount" and "quantity" variables within each store group.

Here's a breakdown of the code:

The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()" and "summarise()" functions together.

"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "store_cd" variable.

"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the sum of the "amount" and "quantity" variables within each store group. We are also specifying the names of the new variables using the assignment operator "=".

The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.

The result of this code will be a new data frame with one row for each store in the "df_receipt" data frame. The "amount" and "quantity" columns will contain the sums of those variables within each store group. This code is useful for calculating store-level metrics such as total sales or units sold.

R-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).

`df_receipt %>% `

group_by(customer_id) %>%

summarise(max_ymd = max(sales_ymd), .groups = "drop") %>%

slice(1:10)

customer_id | max_ymd |
---|---|

<chr> | <int> |

CS001113000004 | 20190308 |

CS001114000005 | 20190731 |

CS001115000010 | 20190405 |

CS001205000004 | 20190625 |

CS001205000006 | 20190224 |

CS001211000025 | 20190322 |

CS001212000027 | 20170127 |

CS001212000031 | 20180906 |

CS001212000046 | 20170811 |

CS001212000070 | 20191018 |

Commentary :

The code "df_receipt %>% group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd), .groups = 'drop') %>% slice(1:10)" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "customer_id" and then find the latest date of purchase for each customer. Finally, it slices the first 10 rows of the resulting data frame.

Here's a breakdown of the code:

The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()", "summarise()", and "slice()" functions together.

"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "customer_id" variable.

"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the latest purchase date for each customer using the "max()" function on the "sales_ymd" variable.

The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.

"slice()" is an R function that is used to extract a subset of rows from a data frame. In this case, we are selecting the first 10 rows of the resulting data frame.

The result of this code will be a new data frame with one row for each customer in the "df_receipt" data frame. The "max_ymd" column will contain the latest purchase date for each customer. Finally, the code selects the first 10 rows of the resulting data frame. This code is useful for understanding the most recent purchase date for each customer and can be used for various customer-level analysis and segmentation purposes.

R-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.

`df_receipt %>% `

group_by(customer_id) %>%

summarise(min_ymd = min(sales_ymd), .groups = "drop") %>%

slice(1:10)

customer_id | min_ymd |
---|---|

<chr> | <int> |

CS001113000004 | 20190308 |

CS001114000005 | 20180503 |

CS001115000010 | 20171228 |

CS001205000004 | 20170914 |

CS001205000006 | 20180207 |

CS001211000025 | 20190322 |

CS001212000027 | 20170127 |

CS001212000031 | 20180906 |

CS001212000046 | 20170811 |

CS001212000070 | 20191018 |

Commentary :

This code is using the`%>%`

(pipe) operator to chain together a sequence of operations on a data frame called`df_receipt`

.

Here's a step-by-step explanation of what each operation does:`group_by(customer_id)`

: This groups the data by the`customer_id`

column, so that subsequent operations are performed separately for each customer.`summarise(min_ymd = min(sales_ymd), .groups = "drop")`

: This creates a summary of the data within each group, calculating the minimum value of the`sales_ymd`

column and storing it in a new column called`min_ymd`

. The`.groups = "drop"`

argument tells`dplyr`

to drop the grouping information from the resulting data frame (since we don't need it for further analysis).`slice(1:10)`

: This selects the first 10 rows of the resulting data frame. Since the data was grouped by`customer_id`

and the summary function was used, this will give us the earliest`sales_ymd`

value for the first 10 customers in the data set.

Overall, this code is selecting the earliest`sales_ymd`

value for each customer in the`df_receipt`

data frame, and returning the earliest dates for the first 10 customers in the data set.

R-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.

`df_receipt %>% `

group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd) ,min_ymd = min(sales_ymd), .groups = "drop") %>%

filter(max_ymd != min_ymd) %>%

slice(1:10)

customer_id | max_ymd | min_ymd |
---|---|---|

<chr> | <int> | <int> |

CS001114000005 | 20190731 | 20180503 |

CS001115000010 | 20190405 | 20171228 |

CS001205000004 | 20190625 | 20170914 |

CS001205000006 | 20190224 | 20180207 |

CS001214000009 | 20190902 | 20170306 |

CS001214000017 | 20191006 | 20180828 |

CS001214000048 | 20190929 | 20171109 |

CS001214000052 | 20190617 | 20180208 |

CS001215000005 | 20181021 | 20170206 |

CS001215000040 | 20171022 | 20170214 |

Commentary :

This code is using the`%>%`

(pipe) operator to chain together a sequence of operations on a data frame called`df_receipt`

.

Here's a step-by-step explanation of what each operation does:`group_by(customer_id)`

: This groups the data by the`customer_id`

column, so that subsequent operations are performed separately for each customer.`summarise(max_ymd = max(sales_ymd), min_ymd = min(sales_ymd), .groups = "drop")`

: This creates a summary of the data within each group, calculating the maximum and minimum values of the`sales_ymd`

column and storing them in new columns called`max_ymd`

and`min_ymd`

, respectively. The`.groups = "drop"`

argument tells`dplyr`

to drop the grouping information from the resulting data frame (since we don't need it for further analysis).`filter(max_ymd != min_ymd)`

: This filters the data frame to include only those rows where the`max_ymd`

value is not equal to the`min_ymd`

value. This will exclude any customers who made only a single purchase during the period covered by the data set.`slice(1:10)`

: This selects the first 10 rows of the resulting data frame. Since the data was grouped by`customer_id`

, summarized using the`max`

and`min`

functions, and filtered based on those summary statistics, this will give us the first 10 customers who made more than one purchase during the period covered by the data set.

Overall, this code is selecting the customers who made more than one purchase during the period covered by the`df_receipt`

data frame, and returning the dates of their earliest and latest purchases. The code then returns the earliest and latest purchases for the first 10 customers who meet these criteria.

R-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.

`df_receipt %>% `

group_by(store_cd) %>%

summarise(mean_amount = mean(amount), .groups = "drop") %>%

arrange(desc(mean_amount)) %>%

slice(1:5)

store_cd | mean_amount |
---|---|

<chr> | <dbl> |

S13052 | 402.8675 |

S13015 | 351.1120 |

S13003 | 350.9155 |

S14010 | 348.7913 |

S13001 | 348.4704 |

Commentary :

This code is using the`%>%`

(pipe) operator to chain together a sequence of operations on a data frame called`df_receipt`

.

Here's a step-by-step explanation of what each operation does:`group_by(store_cd)`

: This groups the data by the`store_cd`

column, so that subsequent operations are performed separately for each store.`summarise(mean_amount = mean(amount), .groups = "drop")`

: This creates a summary of the data within each group, calculating the mean value of the`amount`

column and storing it in a new column called`mean_amount`

. The`.groups = "drop"`

argument tells`dplyr`

to drop the grouping information from the resulting data frame (since we don't need it for further analysis).`arrange(desc(mean_amount))`

: This sorts the resulting data frame by the`mean_amount`

column in descending order (from highest to lowest).`slice(1:5)`

: This selects the first 5 rows of the resulting data frame. Since the data was grouped by`store_cd`

and summarized using the`mean`

function, this will give us the stores with the highest mean`amount`

values across all transactions.

Overall, this code is selecting the stores with the highest mean`amount`

values across all transactions in the`df_receipt`

data frame, and returning the top 5 stores based on those values.

R-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.

`df_receipt %>% `

group_by(store_cd) %>%

summarise(median_amount = median(amount), .groups = "drop") %>% arrange(desc(median_amount)) %>%

slice(1:5)

store_cd | median_amount |
---|---|

<chr> | <dbl> |

S13052 | 190 |

S14010 | 188 |

S14050 | 185 |

S13003 | 180 |

S13018 | 180 |

Commentary :

This code is using the`%>%`

(pipe) operator to chain together a sequence of operations on a data frame called`df_receipt`

.

Here's a step-by-step explanation of what each operation does:`group_by(store_cd)`

: This groups the data by the`store_cd`

column, so that subsequent operations are performed separately for each store.`summarise(median_amount = median(amount), .groups = "drop")`

: This creates a summary of the data within each group, calculating the median value of the`amount`

column and storing it in a new column called`median_amount`

. The`.groups = "drop"`

argument tells`dplyr`

to drop the grouping information from the resulting data frame (since we don't need it for further analysis).`arrange(desc(median_amount))`

: This sorts the resulting data frame by the`median_amount`

column in descending order (from highest to lowest).`slice(1:5)`

: This selects the first 5 rows of the resulting data frame. Since the data was grouped by`store_cd`

and summarized using the`median`

function, this will give us the stores with the highest median`amount`

values across all transactions.

Overall, this code is selecting the stores with the highest median`amount`

values across all transactions in the`df_receipt`

data frame, and returning the top 5 stores based on those values. The difference between this code and the previous one I explained is that this code is using the`median`

function instead of the`mean`

function to calculate the central tendency of the`amount`

values for each store.

R-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.

`# Code example 1 `

df_receipt %>%

group_by(store_cd, product_cd) %>%

summarise(cnt = n(), .groups = "drop_last") %>%

filter(cnt == cnt %>% max()) %>%

ungroup() %>%

slice(1:10)

store_cd | product_cd | cnt |
---|---|---|

<chr> | <chr> | <int> |

S12007 | P060303001 | 72 |

S12013 | P060303001 | 107 |

S12014 | P060303001 | 65 |

S12029 | P060303001 | 92 |

S12030 | P060303001 | 115 |

S13001 | P060303001 | 67 |

S13002 | P060303001 | 78 |

S13003 | P071401001 | 65 |

S13004 | P060303001 | 88 |

S13005 | P040503001 | 36 |

Commentary :

This code is using the`dplyr`

package to perform data manipulation on a data frame called`df_receipt`

. Here is a step-by-step explanation of what the code is doing:`%>%`

is a pipe operator used to chain functions together, so the output of one function becomes the input of the next function.

The`group_by`

function groups the data frame by`store_cd`

and`product_cd`

.

The`summarise`

function calculates a summary statistic for each group. In this case, it is counting the number of rows in each group and creating a new column called`cnt`

.

The`.groups = "drop_last"`

argument is used to drop the last grouping level created by the`summarise`

function, which is the`product_cd`

grouping. This is done to make it easier to filter on the maximum count later on.

The`filter`

function selects rows that meet a certain condition. In this case, it is selecting rows where the`cnt`

column is equal to the maximum value of the`cnt`

column in the data frame.

The`ungroup`

function removes all grouping levels from the data frame.

The`slice`

function selects a subset of rows based on their position. In this case, it is selecting the first 10 rows of the data frame.

So overall, this code is grouping the data by store and product, calculating the count of each group, filtering for the groups with the highest count, and then selecting the top 10 results.

`# Code example 2: using which.max() (multiple occurrences of mode number, narrowed down to one of them) `

# Display of the number of cases is omitted

table_product <-table(df_receipt$store_cd,df_receipt$product_cd)

store <- names(table_product[,1])

mode_product <- c()

for (i in 1:length(store)){

mode_product[i] <- names(which.max(table_product[i,]))

}

data.frame(store_cd = store, product_cd = mode_product) %>%

slice(1:10)

store_cd | product_cd |
---|---|

<chr> | <chr> |

S12007 | P060303001 |

S12013 | P060303001 |

S12014 | P060303001 |

S12029 | P060303001 |

S12030 | P060303001 |

S13001 | P060303001 |

S13002 | P060303001 |

S13003 | P071401001 |

S13004 | P060303001 |

S13005 | P040503001 |

Commentary :

This code is using the`table`

function and`for`

loop to find the mode (most frequent value) of each product code for each store in the`df_receipt`

data frame.

Here's a step-by-step explanation of what the code is doing:

The`table`

function creates a contingency table of counts for each combination of`store_cd`

and`product_cd`

in`df_receipt`

. The resulting object,`table_product`

, is a matrix where the rows correspond to each store and the columns correspond to each product code.

The`names`

function extracts the store codes from the first column of the`table_product`

matrix and saves them as a character vector called`store`

.

An empty vector called`mode_product`

is created to store the mode of each product code for each store.

The`for`

loop iterates through each store in`store`

and finds the mode of each product code for that store. The`which.max`

function is used to find the index of the maximum count in each row of the`table_product`

matrix, and the`names`

function is used to extract the product code with that index. The resulting product code is stored in the`mode_product`

vector.

A data frame is created with two columns:`store_cd`

and`product_cd`

. The`store_cd`

column is populated with the values in the`store`

vector, and the`product_cd`

column is populated with the values in the`mode_product`

vector.

The`%>%`

operator is used to pipe the data frame to the`slice`

function, which selects the first 10 rows of the data frame.

So overall, this code is finding the mode of each product code for each store in`df_receipt`

, creating a data frame with these values, and then selecting the top 10 results.

R-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.

`var_sample <- function(x){ var(x) * (length(x) - 1) / length(x) } `

df_receipt %>%

group_by(store_cd) %>%

summarise(var_amount = var_sample(amount), .groups = "drop") %>%

arrange(desc(var_amount)) %>% slice(1:5)

store_cd | var_amount |
---|---|

<chr> | <dbl> |

S13052 | 440088.7 |

S14011 | 306314.6 |

S14034 | 296920.1 |

S13001 | 295432.0 |

S13015 | 295294.4 |

Commentary :

This code is using the`dplyr`

package to calculate the sample variance of the`amount`

variable for each store in the`df_receipt`

data frame, and then selecting the top 5 stores with the highest variance.

Here's a step-by-step explanation of what the code is doing:

The`var_sample`

function is defined, which calculates the sample variance of a vector of values`x`

using the formula`var(x) * (length(x) - 1) / length(x)`

. This formula adjusts the variance by dividing by`length(x) - 1`

instead of`length(x)`

to provide an unbiased estimate of the population variance.

The`%>%`

operator is used to pipe the`df_receipt`

data frame to the next function in the chain.

The`group_by`

function is used to group the data by`store_cd`

.

The`summarise`

function calculates a summary statistic for each group. In this case, it is using the`var_sample`

function to calculate the sample variance of the`amount`

variable for each group, and creating a new column called`var_amount`

.

The`.groups = "drop"`

argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

The`arrange`

function is used to sort the data frame by`var_amount`

in descending order, so that the stores with the highest variance appear first.

The`%>%`

operator is used to pipe the sorted data frame to the`slice`

function, which selects the first 5 rows of the data frame.

So overall, this code is grouping the data by store, calculating the sample variance of the`amount`

variable for each group, sorting the resulting data frame by variance in descending order, and then selecting the top 5 results.

R-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.

`var_sample <- function(x){ var(x)*(length(x)-1)/length(x) } `

std_sample <- function(x){ sqrt(var_sample(x)) }

head(

df_receipt %>%

group_by(store_cd) %>%

summarise(std_amount = std_sample(amount), .groups = "drop") %>%

arrange(desc(std_amount)) ,

n = 5

)

store_cd | std_amount |
---|---|

<chr> | <dbl> |

S13052 | 663.3918 |

S14011 | 553.4569 |

S14034 | 544.9037 |

S13001 | 543.5366 |

S13015 | 543.4099 |

Commentary :

This code is using the`dplyr`

package to calculate the sample standard deviation of the`amount`

variable for each store in the`df_receipt`

data frame, and then selecting the top 5 stores with the highest standard deviation.

Here's a step-by-step explanation of what the code is doing:

The`var_sample`

function is defined, which calculates the sample variance of a vector of values`x`

using the formula`var(x) * (length(x) - 1) / length(x)`

. This formula adjusts the variance by dividing by`length(x) - 1`

instead of`length(x)`

to provide an unbiased estimate of the population variance.

The`std_sample`

function is defined, which calculates the sample standard deviation of a vector of values`x`

using the formula`sqrt(var_sample(x))`

.

The`head`

function is used to select the first 5 rows of the resulting data frame.

The`%>%`

operator is used to pipe the`df_receipt`

data frame to the next function in the chain.

The`group_by`

function is used to group the data by`store_cd`

.

The`summarise`

function calculates a summary statistic for each group. In this case, it is using the`std_sample`

function to calculate the sample standard deviation of the`amount`

variable for each group, and creating a new column called`std_amount`

.

The`.groups = "drop"`

argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

The`arrange`

function is used to sort the data frame by`std_amount`

in descending order, so that the stores with the highest standard deviation appear first.

The`n = 5`

argument is used to limit the output to the top 5 rows.

So overall, this code is grouping the data by store, calculating the sample standard deviation of the`amount`

variable for each group, sorting the resulting data frame by standard deviation in descending order, and then selecting the top 5 results.

R-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.

`df_receipt %>% `

summarise(amount_25per = quantile(amount, 0.25),

amount_50per = quantile(amount, 0.5),

amount_75per = quantile(amount, 0.75),

amount_100per = quantile(amount, 1.0))

amount_25per | amount_50per | amount_75per | amount_100per |
---|---|---|---|

<dbl> | <dbl> | <dbl> | <dbl> |

102 | 170 | 288 | 10925 |

Commentary :

This code is using the`dplyr`

package to calculate the quartiles of the`amount`

variable in the`df_receipt`

data frame.

Here's a step-by-step explanation of what the code is doing:

The`%>%`

operator is used to pipe the`df_receipt`

data frame to the next function in the chain.

The`summarise`

function is used to calculate summary statistics for the data frame. In this case, it is using the`quantile`

function to calculate the 25th, 50th (median), 75th, and 100th percentiles of the`amount`

variable.

The resulting data frame will have 1 row and 4 columns, with the column names`amount_25per`

,`amount_50per`

,`amount_75per`

, and`amount_100per`

, and the corresponding values of the quartiles.

So overall, this code is calculating the quartiles of the`amount`

variable in the`df_receipt`

data frame, and returning a summary data frame with the quartile values.

R-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.

`df_receipt %>% `

group_by(store_cd) %>%

summarise(mean_amount = mean(amount), .groups = "drop") %>%

filter(mean_amount >= 330)

store_cd | mean_amount |
---|---|

<chr> | <dbl> |

S12013 | 330.1941 |

S13001 | 348.4704 |

S13003 | 350.9155 |

S13004 | 330.9439 |

S13015 | 351.1120 |

S13019 | 330.2086 |

S13020 | 337.8799 |

S13052 | 402.8675 |

S14010 | 348.7913 |

S14011 | 335.7183 |

S14026 | 332.3406 |

S14045 | 330.0821 |

S14047 | 330.0771 |

Commentary :

This code is using the`dplyr`

package to group the`df_receipt`

data frame by`store_cd`

, calculate the mean`amount`

for each group, and then filter the resulting data frame to include only those groups where the mean`amount`

is greater than or equal to 330.

Here's a step-by-step explanation of what the code is doing:

The`%>%`

operator is used to pipe the`df_receipt`

data frame to the next function in the chain.

The`group_by`

function is used to group the data by`store_cd`

.

The`summarise`

function is used to calculate a summary statistic for each group. In this case, it is using the`mean`

function to calculate the mean`amount`

for each group, and creating a new column called`mean_amount`

.

The`.groups = "drop"`

argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

The`filter`

function is used to subset the data frame to only include rows where the`mean_amount`

is greater than or equal to 330.

So overall, this code is grouping the data by`store_cd`

, calculating the mean`amount`

for each group, and then selecting only those groups where the mean`amount`

is greater than or equal to 330. The resulting data frame will include one row for each qualifying store, with the store code and mean amount.

R-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.

`df_mean <- df_receipt %>% `

filter(!grepl("^Z", customer_id)) %>%

group_by(customer_id) %>%

summarise(sum_amount = sum(amount), .groups = "drop") %>%

summarise(mean_amount = mean(sum_amount))

df_mean$mean_amount

`2547.74223452926`

Commentary :

This code is using the`dplyr`

package to calculate the mean amount spent by each customer, after excluding any customer whose`customer_id`

starts with the letter "Z".

Here's a step-by-step explanation of what the code is doing:

The`%>%`

operator is used to pipe the`df_receipt`

data frame to the next function in the chain.

The`filter`

function is used to exclude any rows where the`customer_id`

starts with the letter "Z".

The`group_by`

function is used to group the data by`customer_id`

.

The`summarise`

function is used to calculate a summary statistic for each group. In this case, it is using the`sum`

function to calculate the total`amount`

spent by each customer, and creating a new column called`sum_amount`

.

The`.groups = "drop"`

argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

Another`summarise`

function is used to calculate the mean`sum_amount`

across all customers.

The resulting`mean_amount`

value is extracted using`$mean_amount`

.

So overall, this code is filtering the data to exclude any rows where the`customer_id`

starts with "Z", grouping the data by`customer_id`

, calculating the total`amount`

spent by each customer, calculating the mean`sum_amount`

across all customers, and returning the resulting mean value. The resulting output is a single numeric value representing the mean amount spent by each customer.

R-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.

`df_sum <- df_receipt %>% `

filter(!grepl("^Z", customer_id)) %>%

group_by(customer_id) %>%

summarise(sum_amount = sum(amount), .groups = "drop")

df_mean <- df_sum %>%

summarise(mean_amount = mean(sum_amount))

df_sum %>%

filter(sum_amount >= df_mean$mean_amount) %>%

slice(1:10)

customer_id | sum_amount |
---|---|

<chr> | <int> |

CS001115000010 | 3044 |

CS001205000006 | 3337 |

CS001214000009 | 4685 |

CS001214000017 | 4132 |

CS001214000052 | 5639 |

CS001215000040 | 3496 |

CS001304000006 | 3726 |

CS001305000005 | 3485 |

CS001305000011 | 4370 |

CS001315000180 | 3300 |

Commentary :

This code is using the`dplyr`

package to filter the customers who have spent an amount greater than or equal to the mean amount spent by all customers, after excluding any customer whose`customer_id`

starts with the letter "Z". It then selects the top 10 customers based on their total`amount`

spent.

Here's a step-by-step explanation of what the code is doing:

The`%>%`

operator is used to pipe the`df_receipt`

data frame to the next function in the chain.

The`filter`

function is used to exclude any rows where the`customer_id`

starts with the letter "Z".

The`group_by`

function is used to group the data by`customer_id`

.

The`summarise`

function is used to calculate a summary statistic for each group. In this case, it is using the`sum`

function to calculate the total`amount`

spent by each customer, and creating a new column called`sum_amount`

.

The`.groups = "drop"`

argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.

The`df_sum`

data frame is created by storing the output of the above operations.

The`summarise`

function is used again to calculate the mean`sum_amount`

across all customers in`df_sum`

.

The resulting`mean_amount`

value is extracted using`$mean_amount`

.

The`%>%`

operator is used to pipe`df_sum`

to the next function in the chain.

The`filter`

function is used to select only those rows where the`sum_amount`

is greater than or equal to the`mean_amount`

calculated in step 7.

The`slice`

function is used to select only the top 10 rows of the resulting data frame, based on their order in the original data frame.

So overall, this code is filtering the data to exclude any rows where the`customer_id`

starts with "Z", grouping the data by`customer_id`

, calculating the total`amount`

spent by each customer, calculating the mean`sum_amount`

across all customers, creating a data frame of customers whose total`amount`

spent is greater than or equal to the mean amount spent by all customers, and selecting the top 10 customers based on their total`amount`

spent.

R-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.

`inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd") %>% slice(1:10)`

sales_ymd | sales_epoch | store_cd | receipt_no | receipt_sub_no | customer_id | product_cd | quantity | amount | store_name |
---|---|---|---|---|---|---|---|---|---|

<int> | <int> | <chr> | <int> | <int> | <chr> | <chr> | <int> | <int> | <chr> |

20181103 | 1541203200 | S14006 | 112 | 1 | CS006214000001 | P070305012 | 1 | 158 | 葛が谷店 |

20181118 | 1542499200 | S13008 | 1132 | 2 | CS008415000097 | P070701017 | 1 | 81 | 成城店 |

20170712 | 1499817600 | S14028 | 1102 | 1 | CS028414000014 | P060101005 | 1 | 170 | 二ツ橋店 |

20190205 | 1549324800 | S14042 | 1132 | 1 | ZZ000000000000 | P050301001 | 1 | 25 | 新山下店 |

20180821 | 1534809600 | S14025 | 1102 | 2 | CS025415000050 | P060102007 | 1 | 90 | 大和店 |

20190605 | 1559692800 | S13003 | 1112 | 1 | CS003515000195 | P050102002 | 1 | 138 | 狛江店 |

20181205 | 1543968000 | S14024 | 1102 | 2 | CS024514000042 | P080101005 | 1 | 30 | 三田店 |

20190922 | 1569110400 | S14040 | 1102 | 1 | CS040415000178 | P070501004 | 1 | 128 | 長津田店 |

20170504 | 1493856000 | S13020 | 1112 | 2 | ZZ000000000000 | P071302010 | 1 | 770 | 十条仲原店 |

20191010 | 1570665600 | S14027 | 1102 | 1 | CS027514000015 | P071101003 | 1 | 680 | 南藤沢店 |

Commentary :

This code performs an inner join operation between two data frames,`df_receipt`

and`df_store`

, based on a common column "store_cd". The result of this operation is then piped into the`slice()`

function to select the first 10 rows of the resulting data frame.

Let's break down the code step by step:`inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd")`

: This performs an inner join between the`df_receipt`

data frame and a subset of`df_store`

that only contains the columns "store_cd" and "store_name". The join is done on the "store_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "store_cd" column in both data frames.`%>%`

: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.`slice(1:10)`

: This function selects the first 10 rows of the resulting data frame from the previous command.

So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation between`df_receipt`

and a subset of`df_store`

. The resulting data frame will only contain rows where there is a match between the "store_cd" column in both data frames, and it will have the columns "store_cd" and "store_name" from`df_store`

, along with all the columns from`df_receipt`

.

R-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.

`inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd") %>% slice(1:10)`

product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_small_name |
---|---|---|---|---|---|---|

<chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> |

P040101001 | 04 | 0401 | 040101 | 198 | 149 | 弁当類 |

P040101002 | 04 | 0401 | 040101 | 218 | 164 | 弁当類 |

P040101003 | 04 | 0401 | 040101 | 230 | 173 | 弁当類 |

P040101004 | 04 | 0401 | 040101 | 248 | 186 | 弁当類 |

P040101005 | 04 | 0401 | 040101 | 268 | 201 | 弁当類 |

P040101006 | 04 | 0401 | 040101 | 298 | 224 | 弁当類 |

P040101007 | 04 | 0401 | 040101 | 338 | 254 | 弁当類 |

P040101008 | 04 | 0401 | 040101 | 420 | 315 | 弁当類 |

P040101009 | 04 | 0401 | 040101 | 498 | 374 | 弁当類 |

P040101010 | 04 | 0401 | 040101 | 580 | 435 | 弁当類 |

Commentary :

This code performs an inner join operation between two data frames,`df_product`

and`df_category`

, based on a common column "category_small_cd". The result of this operation is then piped into the`slice()`

function to select the first 10 rows of the resulting data frame.

Let's break down the code step by step:`inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd")`

: This performs an inner join between the`df_product`

data frame and a subset of`df_category`

that only contains the columns "category_small_cd" and "category_small_name". The join is done on the "category_small_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "category_small_cd" column in both data frames.`%>%`

: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.`slice(1:10)`

: This function selects the first 10 rows of the resulting data frame from the previous command.

So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation between`df_product`

and a subset of`df_category`

. The resulting data frame will only contain rows where there is a match between the "category_small_cd" column in both data frames, and it will have the columns "category_small_cd" and "category_small_name" from`df_category`

, along with all the columns from`df_product`

.

R-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.

`df_sum <- df_receipt %>% `

group_by(customer_id) %>%

summarise(sum_amount = sum(amount), .groups = "drop")

df_target <- df_customer %>%

filter(gender_cd == "1" & !grepl("^Z", customer_id))

left_join(df_target["customer_id"], df_sum, by = "customer_id") %>%

replace_na(list(sum_amount = 0)) %>% slice(1:10)

customer_id | sum_amount |
---|---|

<chr> | <dbl> |

CS021313000114 | 0 |

CS031415000172 | 5088 |

CS028811000001 | 0 |

CS001215000145 | 875 |

CS015414000103 | 3122 |

CS033513000180 | 868 |

CS035614000014 | 0 |

CS011215000048 | 3444 |

CS009413000079 | 0 |

CS040412000191 | 210 |

Commentary :

This code performs several operations to obtain a summary table for the customer data. The result of this operation is then piped into the`replace_na()`

function to replace any missing values with a default value of 0. Finally, the result is piped into the`slice()`

function to select the first 10 rows of the resulting data frame.

Let's break down the code step by step:`df_sum <- df_receipt %>% group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop")`

: This code first groups the`df_receipt`

data frame by the "customer_id" column, and then summarizes the "amount" column using the`sum()`

function. The resulting data frame,`df_sum`

, contains the total amount spent by each customer. The`.groups`

parameter is set to "drop" to remove the grouping information from the output.`df_target <- df_customer %>% filter(gender_cd == "1" & !grepl("^Z", customer_id))`

: This code filters the`df_customer`

data frame to only include rows where the "gender_cd" column equals "1" and the "customer_id" column does not start with the letter "Z". The resulting data frame,`df_target`

, contains only male customers whose IDs do not start with "Z".`left_join(df_target["customer_id"], df_sum, by = "customer_id")`

: This performs a left join operation between`df_target["customer_id"]`

and`df_sum`

based on the "customer_id" column. The result is a new data frame that contains all the rows from`df_target`

along with the corresponding total amount spent by each customer from`df_sum`

.`replace_na(list(sum_amount = 0))`

: This replaces any missing values in the "sum_amount" column with a default value of 0.`%>% slice(1:10)`

: This function selects the first 10 rows of the resulting data frame from the previous command.

So, the overall purpose of this code is to obtain a summary table of the total amount spent by male customers whose IDs do not start with the letter "Z". The resulting data frame will have two columns, "customer_id" and "sum_amount", where "sum_amount" is the total amount spent by each customer. Any missing values in the "sum_amount" column will be replaced with 0, and the resulting data frame will be limited to the first 10 rows.

R-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).

`# Code example 1 `

df_data <- df_receipt %>%

filter(!grepl("^Z", customer_id)) %>%

group_by(customer_id)

df_cnt <- df_data %>%

summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>%

arrange(desc(come_days), customer_id) %>%

slice(1:20)

df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>%

arrange(desc(sum_amount)) %>%

slice(1:20)

full_join(df_cnt, df_sum, by = "customer_id")

customer_id | come_days | sum_amount |
---|---|---|

<chr> | <int> | <int> |

CS040214000008 | 23 | NA |

CS010214000010 | 22 | 18585 |

CS015415000185 | 22 | 20153 |

CS010214000002 | 21 | NA |

CS028415000007 | 21 | 19127 |

CS016415000141 | 20 | 18372 |

CS017415000097 | 20 | 23086 |

CS014214000023 | 19 | NA |

CS021514000045 | 19 | NA |

CS021515000172 | 19 | NA |

CS022515000226 | 19 | NA |

CS031414000051 | 19 | 19202 |

CS039414000052 | 19 | NA |

CS007515000107 | 18 | NA |

CS014415000077 | 18 | NA |

CS021515000056 | 18 | NA |

CS021515000211 | 18 | NA |

CS022515000028 | 18 | NA |

CS030214000008 | 18 | NA |

CS031414000073 | 18 | NA |

CS001605000009 | NA | 18925 |

CS006515000023 | NA | 18372 |

CS011414000106 | NA | 18338 |

CS038415000104 | NA | 17847 |

CS035414000024 | NA | 17615 |

CS021515000089 | NA | 17580 |

CS032414000072 | NA | 16563 |

CS016415000101 | NA | 16348 |

CS011415000006 | NA | 16094 |

CS034415000047 | NA | 16083 |

CS007514000094 | NA | 15735 |

CS009414000059 | NA | 15492 |

CS030415000034 | NA | 15468 |

CS015515000034 | NA | 15300 |

Commentary :

This code performs several operations to obtain two summary tables for the customer data, and then performs a full join operation between these two tables based on the "customer_id" column. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

Let's break down the code step by step:`df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id)`

: This code first filters the`df_receipt`

data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column. The resulting`df_data`

data frame contains all the rows from`df_receipt`

with "Z" customer IDs removed, and is grouped by "customer_id".`df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>% arrange(desc(come_days), customer_id) %>% slice(1:20)`

: This code first summarizes the`df_data`

data frame by counting the number of distinct "sales_ymd" values for each customer using the`n_distinct()`

function. The resulting data frame,`df_cnt`

, contains the total number of unique sales days for each customer. The data frame is then arranged in descending order of "come_days" followed by "customer_id". The first 20 rows of the resulting data frame are then selected using the`slice()`

function.`df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>% arrange(desc(sum_amount)) %>% slice(1:20)`

: This code summarizes the`df_data`

data frame by calculating the sum of the "amount" column for each customer using the`sum()`

function. The resulting data frame,`df_sum`

, contains the total amount spent by each customer. The data frame is then arranged in descending order of "sum_amount", and the first 20 rows of the resulting data frame are selected using the`slice()`

function.`full_join(df_cnt, df_sum, by = "customer_id")`

: This code performs a full join operation between`df_cnt`

and`df_sum`

based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

`# Code example 2`

# Code example using slice_max.

# Rewrite with_ties = FALSE to with_ties = TRUE to change the form to include the same rank

df_data <- df_receipt %>%

filter(!grepl("^Z", customer_id)) %>%

group_by(customer_id)

df_cnt <- df_data %>%

summarise(come_days = n_distinct(sales_ymd)) %>%

slice_max(come_days, n = 20, with_ties = FALSE)

df_sum <- df_data %>%

summarise(sum_amount = sum(amount)) %>%

slice_max(sum_amount, n = 20, with_ties = FALSE)

full_join(df_cnt, df_sum, by = "customer_id")

customer_id | come_days | sum_amount |
---|---|---|

<chr> | <int> | <int> |

CS040214000008 | 23 | NA |

CS010214000010 | 22 | 18585 |

CS015415000185 | 22 | 20153 |

CS010214000002 | 21 | NA |

CS028415000007 | 21 | 19127 |

CS016415000141 | 20 | 18372 |

CS017415000097 | 20 | 23086 |

CS014214000023 | 19 | NA |

CS021514000045 | 19 | NA |

CS021515000172 | 19 | NA |

CS022515000226 | 19 | NA |

CS031414000051 | 19 | 19202 |

CS039414000052 | 19 | NA |

CS007515000107 | 18 | NA |

CS014415000077 | 18 | NA |

CS021515000056 | 18 | NA |

CS021515000211 | 18 | NA |

CS022515000028 | 18 | NA |

CS030214000008 | 18 | NA |

CS031414000073 | 18 | NA |

CS001605000009 | NA | 18925 |

CS006515000023 | NA | 18372 |

CS011414000106 | NA | 18338 |

CS038415000104 | NA | 17847 |

CS035414000024 | NA | 17615 |

CS021515000089 | NA | 17580 |

CS032414000072 | NA | 16563 |

CS016415000101 | NA | 16348 |

CS011415000006 | NA | 16094 |

CS034415000047 | NA | 16083 |

CS007514000094 | NA | 15735 |

CS009414000059 | NA | 15492 |

CS030415000034 | NA | 15468 |

CS015515000034 | NA | 15300 |

Commentary :

This code is similar to the previous code we discussed, but it uses the`slice_max()`

function instead of the`arrange()`

function to obtain the top 20 customers based on the number of distinct sales days and the total amount spent.

Here is a breakdown of the code:`df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id)`

: This code is similar to the previous code, where it filters the`df_receipt`

data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column.`df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd)) %>% slice_max(come_days, n = 20, with_ties = FALSE)`

: This code summarizes the`df_data`

data frame by counting the number of distinct "sales_ymd" values for each customer using the`n_distinct()`

function. The resulting data frame,`df_cnt`

, contains the total number of unique sales days for each customer. The`slice_max()`

function is used to select the top 20 rows based on the "come_days" column, where`with_ties = FALSE`

specifies that only the top 20 distinct values should be selected.`df_sum <- df_data %>% summarise(sum_amount = sum(amount)) %>% slice_max(sum_amount, n = 20, with_ties = FALSE)`

: This code summarizes the`df_data`

data frame by calculating the sum of the "amount" column for each customer using the`sum()`

function. The resulting data frame,`df_sum`

, contains the total amount spent by each customer. The`slice_max()`

function is used to select the top 20 rows based on the "sum_amount" column, where`with_ties = FALSE`

specifies that only the top 20 distinct values should be selected.`full_join(df_cnt, df_sum, by = "customer_id")`

: This code performs a full join operation between`df_cnt`

and`df_sum`

based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent, obtained using the`slice_max()`

function instead of the`arrange()`

function.

R-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.

`df_store_tmp <- df_store `

df_product_tmp <- df_product

df_store_tmp["key"] <- 0

df_product_tmp["key"] <- 0

nrow(full_join(df_store_tmp, df_product_tmp, by = "key"))

`531590`

Commentary :

This code performs a full join operation between the`df_store`

and`df_product`

data frames based on a common "key" column that is added to both data frames. The resulting data frame is then used to calculate the number of rows using the`nrow()`

function.

Here is a breakdown of the code:`df_store_tmp <- df_store`

: This code creates a new data frame called`df_store_tmp`

that is a copy of the`df_store`

data frame. This is done to avoid modifying the original data frame.`df_product_tmp <- df_product`

: This code creates a new data frame called`df_product_tmp`

that is a copy of the`df_product`

data frame. This is done to avoid modifying the original data frame.`df_store_tmp["key"] <- 0`

: This code adds a new column called "key" to the`df_store_tmp`

data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.`df_product_tmp["key"] <- 0`

: This code adds a new column called "key" to the`df_product_tmp`

data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.`full_join(df_store_tmp, df_product_tmp, by = "key")`

: This code performs a full join operation between the`df_store_tmp`

and`df_product_tmp`

data frames based on the common "key" column. Since both data frames have the same value in this column for all rows, a full join will result in a Cartesian product of the two data frames, meaning that every row from`df_store_tmp`

will be paired with every row from`df_product_tmp`

.`nrow(full_join(df_store_tmp, df_product_tmp, by = "key"))`

: This code calculates the number of rows in the resulting data frame from the full join operation using the`nrow()`

function. The result is the total number of combinations of rows between`df_store`

and`df_product`

.

In summary, this code creates copies of the`df_store`

and`df_product`

data frames, adds a common "key" column to both data frames with all values set to zero, performs a full join operation between the two data frames based on the common "key" column, and calculates the number of resulting rows. The purpose of this code is to determine the total number of combinations of rows between the`df_store`

and`df_product`

data frames.

## Comment