参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- To begin with, run the following cells
- Import the required libraries and read the data from the database (PostgreSQL)
- The libraries that are expected to be used are imported in the following cells
- If there are other libraries you wish to use, install them as appropriate using install.packages().
- Names, addresses, etc. are dummy data and are not real.
require("RPostgreSQL")
require("tidyr")
require("dplyr")
require("stringr")
require("caret")
require("lubridate")
require("rsample")
require("recipes")
require("themis")
host <- "db"
port <- Sys.getenv()["PG_PORT"]
dbname <- Sys.getenv()["PG_DATABASE"]
user <- Sys.getenv()["PG_USER"]
password <- Sys.getenv()["PG_PASSWORD"]
con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password)
df_customer <- dbGetQuery(con,"SELECT * FROM customer")
df_category <- dbGetQuery(con,"SELECT * FROM category")
df_product <- dbGetQuery(con,"SELECT * FROM product")
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt")
df_store <- dbGetQuery(con,"SELECT * FROM store")
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")
Loading required package: RPostgreSQL
Loading required package: DBI
Loading required package: tidyr
Loading required package: dplyr
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Loading required package: stringr
Loading required package: caret
Loading required package: ggplot2
Loading required package: lattice
Loading required package: lubridate
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
Loading required package: rsample
Loading required package: recipes
Attaching package: ‘recipes’
The following object is masked from ‘package:stringr’:
fixed
The following object is masked from ‘package:stats’:
step
Loading required package: themis
Registered S3 methods overwritten by 'themis':
method from
bake.step_downsample recipes
bake.step_upsample recipes
prep.step_downsample recipes
prep.step_upsample recipes
tidy.step_downsample recipes
tidy.step_upsample recipes
tunable.step_downsample recipes
tunable.step_upsample recipes
Attaching package: ‘themis’
The following objects are masked from ‘package:recipes’:
step_downsample, step_upsample
Practice Questions
R-021: Count the number of cases against the receipt details data (df_receipt).
nrow(df_receipt)
104681
Commentary :
The code "nrow(df_receipt)" is written in R or a similar statistical programming language. It is used to determine the number of rows in a data frame called "df_receipt".
Here's a breakdown of the code:
"nrow()" is a built-in R function that returns the number of rows in a data frame.
"df_receipt" is the name of the data frame whose number of rows we want to determine.
For example, if the data frame "df_receipt" contains 100 rows of data, then running this code would return the value 100. This code is useful for quickly checking the size of a data frame and ensuring that it has the expected number of rows before performing further analysis or manipulating the data in some way.
R-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).
length(unique(df_receipt$customer_id))
8307
Commentary :
The code "length(unique(df_receipt$customer_id))" is written in R or a similar statistical programming language. It is used to determine the number of unique customers in a data frame called "df_receipt".
Here's a breakdown of the code:
"unique()" is an R function that returns a vector containing only the unique elements of a vector. In this case, the vector being passed to the "unique()" function is "df_receipt$customer_id", which is a column in the "df_receipt" data frame that contains customer IDs.
"length()" is another built-in R function that returns the number of elements in a vector. In this case, the vector being passed to "length()" is the output of the "unique()" function, which is a vector containing unique customer IDs.
Putting it all together, the code calculates the number of unique customers in the "df_receipt" data frame by taking the length of a vector of unique customer IDs. For example, if the data frame contains 1000 rows, but only 500 unique customer IDs, then running this code would return the value 500. This code is useful for understanding the number of unique customers in a data set and for calculating customer-level metrics such as customer lifetime value.
R-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).
df_receipt %>%
group_by(store_cd) %>%
summarise(amount = sum(amount), quantity = sum(quantity), .groups = "drop")
| store_cd | amount | quantity |
|---|---|---|
| <chr> | <int> | <int> |
| S12007 | 638761 | 2099 |
| S12013 | 787513 | 2425 |
| S12014 | 725167 | 2358 |
| S12029 | 794741 | 2555 |
| S12030 | 684402 | 2403 |
| S13001 | 811936 | 2347 |
| S13002 | 727821 | 2340 |
| S13003 | 764294 | 2197 |
| S13004 | 779373 | 2390 |
| S13005 | 629876 | 2004 |
| S13008 | 809288 | 2491 |
| S13009 | 808870 | 2486 |
| S13015 | 780873 | 2248 |
| S13016 | 793773 | 2432 |
| S13017 | 748221 | 2376 |
| S13018 | 790535 | 2562 |
| S13019 | 827833 | 2541 |
| S13020 | 796383 | 2383 |
| S13031 | 705968 | 2336 |
| S13032 | 790501 | 2491 |
| S13035 | 715869 | 2219 |
| S13037 | 693087 | 2344 |
| S13038 | 708884 | 2337 |
| S13039 | 611888 | 1981 |
| S13041 | 728266 | 2233 |
| S13043 | 587895 | 1881 |
| S13044 | 520764 | 1729 |
| S13051 | 107452 | 354 |
| S13052 | 100314 | 250 |
| S14006 | 712839 | 2284 |
| S14010 | 790361 | 2290 |
| S14011 | 805724 | 2434 |
| S14012 | 720600 | 2412 |
| S14021 | 699511 | 2231 |
| S14022 | 651328 | 2047 |
| S14023 | 727630 | 2258 |
| S14024 | 736323 | 2417 |
| S14025 | 755581 | 2394 |
| S14026 | 824537 | 2503 |
| S14027 | 714550 | 2303 |
| S14028 | 786145 | 2458 |
| S14033 | 725318 | 2282 |
| S14034 | 653681 | 2024 |
| S14036 | 203694 | 635 |
| S14040 | 701858 | 2233 |
| S14042 | 534689 | 1935 |
| S14045 | 458484 | 1398 |
| S14046 | 412646 | 1354 |
| S14047 | 338329 | 1041 |
| S14048 | 234276 | 769 |
| S14049 | 230808 | 788 |
| S14050 | 167090 | 580 |
Commentary :
The code "df_receipt %>% group_by(store_cd) %>% summarise(amount = sum(amount), quantity = sum(quantity), .groups = 'drop')" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "store_cd" and then calculate the sum of the "amount" and "quantity" variables within each store group.
Here's a breakdown of the code:
The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()" and "summarise()" functions together.
"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "store_cd" variable.
"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the sum of the "amount" and "quantity" variables within each store group. We are also specifying the names of the new variables using the assignment operator "=".
The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.
The result of this code will be a new data frame with one row for each store in the "df_receipt" data frame. The "amount" and "quantity" columns will contain the sums of those variables within each store group. This code is useful for calculating store-level metrics such as total sales or units sold.
R-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).
df_receipt %>%
group_by(customer_id) %>%
summarise(max_ymd = max(sales_ymd), .groups = "drop") %>%
slice(1:10)
| customer_id | max_ymd |
|---|---|
| <chr> | <int> |
| CS001113000004 | 20190308 |
| CS001114000005 | 20190731 |
| CS001115000010 | 20190405 |
| CS001205000004 | 20190625 |
| CS001205000006 | 20190224 |
| CS001211000025 | 20190322 |
| CS001212000027 | 20170127 |
| CS001212000031 | 20180906 |
| CS001212000046 | 20170811 |
| CS001212000070 | 20191018 |
Commentary :
The code "df_receipt %>% group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd), .groups = 'drop') %>% slice(1:10)" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "customer_id" and then find the latest date of purchase for each customer. Finally, it slices the first 10 rows of the resulting data frame.
Here's a breakdown of the code:
The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()", "summarise()", and "slice()" functions together.
"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "customer_id" variable.
"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the latest purchase date for each customer using the "max()" function on the "sales_ymd" variable.
The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.
"slice()" is an R function that is used to extract a subset of rows from a data frame. In this case, we are selecting the first 10 rows of the resulting data frame.
The result of this code will be a new data frame with one row for each customer in the "df_receipt" data frame. The "max_ymd" column will contain the latest purchase date for each customer. Finally, the code selects the first 10 rows of the resulting data frame. This code is useful for understanding the most recent purchase date for each customer and can be used for various customer-level analysis and segmentation purposes.
R-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.
df_receipt %>%
group_by(customer_id) %>%
summarise(min_ymd = min(sales_ymd), .groups = "drop") %>%
slice(1:10)
| customer_id | min_ymd |
|---|---|
| <chr> | <int> |
| CS001113000004 | 20190308 |
| CS001114000005 | 20180503 |
| CS001115000010 | 20171228 |
| CS001205000004 | 20170914 |
| CS001205000006 | 20180207 |
| CS001211000025 | 20190322 |
| CS001212000027 | 20170127 |
| CS001212000031 | 20180906 |
| CS001212000046 | 20170811 |
| CS001212000070 | 20191018 |
Commentary :
This code is using the%>%(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt.
Here's a step-by-step explanation of what each operation does:group_by(customer_id): This groups the data by thecustomer_idcolumn, so that subsequent operations are performed separately for each customer.summarise(min_ymd = min(sales_ymd), .groups = "drop"): This creates a summary of the data within each group, calculating the minimum value of thesales_ymdcolumn and storing it in a new column calledmin_ymd. The.groups = "drop"argument tellsdplyrto drop the grouping information from the resulting data frame (since we don't need it for further analysis).slice(1:10): This selects the first 10 rows of the resulting data frame. Since the data was grouped bycustomer_idand the summary function was used, this will give us the earliestsales_ymdvalue for the first 10 customers in the data set.
Overall, this code is selecting the earliestsales_ymdvalue for each customer in thedf_receiptdata frame, and returning the earliest dates for the first 10 customers in the data set.
R-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.
df_receipt %>%
group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd) ,min_ymd = min(sales_ymd), .groups = "drop") %>%
filter(max_ymd != min_ymd) %>%
slice(1:10)
| customer_id | max_ymd | min_ymd |
|---|---|---|
| <chr> | <int> | <int> |
| CS001114000005 | 20190731 | 20180503 |
| CS001115000010 | 20190405 | 20171228 |
| CS001205000004 | 20190625 | 20170914 |
| CS001205000006 | 20190224 | 20180207 |
| CS001214000009 | 20190902 | 20170306 |
| CS001214000017 | 20191006 | 20180828 |
| CS001214000048 | 20190929 | 20171109 |
| CS001214000052 | 20190617 | 20180208 |
| CS001215000005 | 20181021 | 20170206 |
| CS001215000040 | 20171022 | 20170214 |
Commentary :
This code is using the%>%(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt.
Here's a step-by-step explanation of what each operation does:group_by(customer_id): This groups the data by thecustomer_idcolumn, so that subsequent operations are performed separately for each customer.summarise(max_ymd = max(sales_ymd), min_ymd = min(sales_ymd), .groups = "drop"): This creates a summary of the data within each group, calculating the maximum and minimum values of thesales_ymdcolumn and storing them in new columns calledmax_ymdandmin_ymd, respectively. The.groups = "drop"argument tellsdplyrto drop the grouping information from the resulting data frame (since we don't need it for further analysis).filter(max_ymd != min_ymd): This filters the data frame to include only those rows where themax_ymdvalue is not equal to themin_ymdvalue. This will exclude any customers who made only a single purchase during the period covered by the data set.slice(1:10): This selects the first 10 rows of the resulting data frame. Since the data was grouped bycustomer_id, summarized using themaxandminfunctions, and filtered based on those summary statistics, this will give us the first 10 customers who made more than one purchase during the period covered by the data set.
Overall, this code is selecting the customers who made more than one purchase during the period covered by thedf_receiptdata frame, and returning the dates of their earliest and latest purchases. The code then returns the earliest and latest purchases for the first 10 customers who meet these criteria.
R-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.
df_receipt %>%
group_by(store_cd) %>%
summarise(mean_amount = mean(amount), .groups = "drop") %>%
arrange(desc(mean_amount)) %>%
slice(1:5)
| store_cd | mean_amount |
|---|---|
| <chr> | <dbl> |
| S13052 | 402.8675 |
| S13015 | 351.1120 |
| S13003 | 350.9155 |
| S14010 | 348.7913 |
| S13001 | 348.4704 |
Commentary :
This code is using the%>%(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt.
Here's a step-by-step explanation of what each operation does:group_by(store_cd): This groups the data by thestore_cdcolumn, so that subsequent operations are performed separately for each store.summarise(mean_amount = mean(amount), .groups = "drop"): This creates a summary of the data within each group, calculating the mean value of theamountcolumn and storing it in a new column calledmean_amount. The.groups = "drop"argument tellsdplyrto drop the grouping information from the resulting data frame (since we don't need it for further analysis).arrange(desc(mean_amount)): This sorts the resulting data frame by themean_amountcolumn in descending order (from highest to lowest).slice(1:5): This selects the first 5 rows of the resulting data frame. Since the data was grouped bystore_cdand summarized using themeanfunction, this will give us the stores with the highest meanamountvalues across all transactions.
Overall, this code is selecting the stores with the highest meanamountvalues across all transactions in thedf_receiptdata frame, and returning the top 5 stores based on those values.
R-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.
df_receipt %>%
group_by(store_cd) %>%
summarise(median_amount = median(amount), .groups = "drop") %>% arrange(desc(median_amount)) %>%
slice(1:5)
| store_cd | median_amount |
|---|---|
| <chr> | <dbl> |
| S13052 | 190 |
| S14010 | 188 |
| S14050 | 185 |
| S13003 | 180 |
| S13018 | 180 |
Commentary :
This code is using the%>%(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt.
Here's a step-by-step explanation of what each operation does:group_by(store_cd): This groups the data by thestore_cdcolumn, so that subsequent operations are performed separately for each store.summarise(median_amount = median(amount), .groups = "drop"): This creates a summary of the data within each group, calculating the median value of theamountcolumn and storing it in a new column calledmedian_amount. The.groups = "drop"argument tellsdplyrto drop the grouping information from the resulting data frame (since we don't need it for further analysis).arrange(desc(median_amount)): This sorts the resulting data frame by themedian_amountcolumn in descending order (from highest to lowest).slice(1:5): This selects the first 5 rows of the resulting data frame. Since the data was grouped bystore_cdand summarized using themedianfunction, this will give us the stores with the highest medianamountvalues across all transactions.
Overall, this code is selecting the stores with the highest medianamountvalues across all transactions in thedf_receiptdata frame, and returning the top 5 stores based on those values. The difference between this code and the previous one I explained is that this code is using themedianfunction instead of themeanfunction to calculate the central tendency of theamountvalues for each store.
R-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.
# Code example 1
df_receipt %>%
group_by(store_cd, product_cd) %>%
summarise(cnt = n(), .groups = "drop_last") %>%
filter(cnt == cnt %>% max()) %>%
ungroup() %>%
slice(1:10)
| store_cd | product_cd | cnt |
|---|---|---|
| <chr> | <chr> | <int> |
| S12007 | P060303001 | 72 |
| S12013 | P060303001 | 107 |
| S12014 | P060303001 | 65 |
| S12029 | P060303001 | 92 |
| S12030 | P060303001 | 115 |
| S13001 | P060303001 | 67 |
| S13002 | P060303001 | 78 |
| S13003 | P071401001 | 65 |
| S13004 | P060303001 | 88 |
| S13005 | P040503001 | 36 |
Commentary :
This code is using thedplyrpackage to perform data manipulation on a data frame calleddf_receipt. Here is a step-by-step explanation of what the code is doing:%>%is a pipe operator used to chain functions together, so the output of one function becomes the input of the next function.
Thegroup_byfunction groups the data frame bystore_cdandproduct_cd.
Thesummarisefunction calculates a summary statistic for each group. In this case, it is counting the number of rows in each group and creating a new column calledcnt.
The.groups = "drop_last"argument is used to drop the last grouping level created by thesummarisefunction, which is theproduct_cdgrouping. This is done to make it easier to filter on the maximum count later on.
Thefilterfunction selects rows that meet a certain condition. In this case, it is selecting rows where thecntcolumn is equal to the maximum value of thecntcolumn in the data frame.
Theungroupfunction removes all grouping levels from the data frame.
Theslicefunction selects a subset of rows based on their position. In this case, it is selecting the first 10 rows of the data frame.
So overall, this code is grouping the data by store and product, calculating the count of each group, filtering for the groups with the highest count, and then selecting the top 10 results.
# Code example 2: using which.max() (multiple occurrences of mode number, narrowed down to one of them)
# Display of the number of cases is omitted
table_product <-table(df_receipt$store_cd,df_receipt$product_cd)
store <- names(table_product[,1])
mode_product <- c()
for (i in 1:length(store)){
mode_product[i] <- names(which.max(table_product[i,]))
}
data.frame(store_cd = store, product_cd = mode_product) %>%
slice(1:10)
| store_cd | product_cd |
|---|---|
| <chr> | <chr> |
| S12007 | P060303001 |
| S12013 | P060303001 |
| S12014 | P060303001 |
| S12029 | P060303001 |
| S12030 | P060303001 |
| S13001 | P060303001 |
| S13002 | P060303001 |
| S13003 | P071401001 |
| S13004 | P060303001 |
| S13005 | P040503001 |
Commentary :
This code is using thetablefunction andforloop to find the mode (most frequent value) of each product code for each store in thedf_receiptdata frame.
Here's a step-by-step explanation of what the code is doing:
Thetablefunction creates a contingency table of counts for each combination ofstore_cdandproduct_cdindf_receipt. The resulting object,table_product, is a matrix where the rows correspond to each store and the columns correspond to each product code.
Thenamesfunction extracts the store codes from the first column of thetable_productmatrix and saves them as a character vector calledstore.
An empty vector calledmode_productis created to store the mode of each product code for each store.
Theforloop iterates through each store instoreand finds the mode of each product code for that store. Thewhich.maxfunction is used to find the index of the maximum count in each row of thetable_productmatrix, and thenamesfunction is used to extract the product code with that index. The resulting product code is stored in themode_productvector.
A data frame is created with two columns:store_cdandproduct_cd. Thestore_cdcolumn is populated with the values in thestorevector, and theproduct_cdcolumn is populated with the values in themode_productvector.
The%>%operator is used to pipe the data frame to theslicefunction, which selects the first 10 rows of the data frame.
So overall, this code is finding the mode of each product code for each store indf_receipt, creating a data frame with these values, and then selecting the top 10 results.
R-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.
var_sample <- function(x){ var(x) * (length(x) - 1) / length(x) }
df_receipt %>%
group_by(store_cd) %>%
summarise(var_amount = var_sample(amount), .groups = "drop") %>%
arrange(desc(var_amount)) %>% slice(1:5)
| store_cd | var_amount |
|---|---|
| <chr> | <dbl> |
| S13052 | 440088.7 |
| S14011 | 306314.6 |
| S14034 | 296920.1 |
| S13001 | 295432.0 |
| S13015 | 295294.4 |
Commentary :
This code is using thedplyrpackage to calculate the sample variance of theamountvariable for each store in thedf_receiptdata frame, and then selecting the top 5 stores with the highest variance.
Here's a step-by-step explanation of what the code is doing:
Thevar_samplefunction is defined, which calculates the sample variance of a vector of valuesxusing the formulavar(x) * (length(x) - 1) / length(x). This formula adjusts the variance by dividing bylength(x) - 1instead oflength(x)to provide an unbiased estimate of the population variance.
The%>%operator is used to pipe thedf_receiptdata frame to the next function in the chain.
Thegroup_byfunction is used to group the data bystore_cd.
Thesummarisefunction calculates a summary statistic for each group. In this case, it is using thevar_samplefunction to calculate the sample variance of theamountvariable for each group, and creating a new column calledvar_amount.
The.groups = "drop"argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thearrangefunction is used to sort the data frame byvar_amountin descending order, so that the stores with the highest variance appear first.
The%>%operator is used to pipe the sorted data frame to theslicefunction, which selects the first 5 rows of the data frame.
So overall, this code is grouping the data by store, calculating the sample variance of theamountvariable for each group, sorting the resulting data frame by variance in descending order, and then selecting the top 5 results.
R-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.
var_sample <- function(x){ var(x)*(length(x)-1)/length(x) }
std_sample <- function(x){ sqrt(var_sample(x)) }
head(
df_receipt %>%
group_by(store_cd) %>%
summarise(std_amount = std_sample(amount), .groups = "drop") %>%
arrange(desc(std_amount)) ,
n = 5
)
| store_cd | std_amount |
|---|---|
| <chr> | <dbl> |
| S13052 | 663.3918 |
| S14011 | 553.4569 |
| S14034 | 544.9037 |
| S13001 | 543.5366 |
| S13015 | 543.4099 |
Commentary :
This code is using thedplyrpackage to calculate the sample standard deviation of theamountvariable for each store in thedf_receiptdata frame, and then selecting the top 5 stores with the highest standard deviation.
Here's a step-by-step explanation of what the code is doing:
Thevar_samplefunction is defined, which calculates the sample variance of a vector of valuesxusing the formulavar(x) * (length(x) - 1) / length(x). This formula adjusts the variance by dividing bylength(x) - 1instead oflength(x)to provide an unbiased estimate of the population variance.
Thestd_samplefunction is defined, which calculates the sample standard deviation of a vector of valuesxusing the formulasqrt(var_sample(x)).
Theheadfunction is used to select the first 5 rows of the resulting data frame.
The%>%operator is used to pipe thedf_receiptdata frame to the next function in the chain.
Thegroup_byfunction is used to group the data bystore_cd.
Thesummarisefunction calculates a summary statistic for each group. In this case, it is using thestd_samplefunction to calculate the sample standard deviation of theamountvariable for each group, and creating a new column calledstd_amount.
The.groups = "drop"argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thearrangefunction is used to sort the data frame bystd_amountin descending order, so that the stores with the highest standard deviation appear first.
Then = 5argument is used to limit the output to the top 5 rows.
So overall, this code is grouping the data by store, calculating the sample standard deviation of theamountvariable for each group, sorting the resulting data frame by standard deviation in descending order, and then selecting the top 5 results.
R-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.
df_receipt %>%
summarise(amount_25per = quantile(amount, 0.25),
amount_50per = quantile(amount, 0.5),
amount_75per = quantile(amount, 0.75),
amount_100per = quantile(amount, 1.0))
| amount_25per | amount_50per | amount_75per | amount_100per |
|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> |
| 102 | 170 | 288 | 10925 |
Commentary :
This code is using thedplyrpackage to calculate the quartiles of theamountvariable in thedf_receiptdata frame.
Here's a step-by-step explanation of what the code is doing:
The%>%operator is used to pipe thedf_receiptdata frame to the next function in the chain.
Thesummarisefunction is used to calculate summary statistics for the data frame. In this case, it is using thequantilefunction to calculate the 25th, 50th (median), 75th, and 100th percentiles of theamountvariable.
The resulting data frame will have 1 row and 4 columns, with the column namesamount_25per,amount_50per,amount_75per, andamount_100per, and the corresponding values of the quartiles.
So overall, this code is calculating the quartiles of theamountvariable in thedf_receiptdata frame, and returning a summary data frame with the quartile values.
R-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.
df_receipt %>%
group_by(store_cd) %>%
summarise(mean_amount = mean(amount), .groups = "drop") %>%
filter(mean_amount >= 330)
| store_cd | mean_amount |
|---|---|
| <chr> | <dbl> |
| S12013 | 330.1941 |
| S13001 | 348.4704 |
| S13003 | 350.9155 |
| S13004 | 330.9439 |
| S13015 | 351.1120 |
| S13019 | 330.2086 |
| S13020 | 337.8799 |
| S13052 | 402.8675 |
| S14010 | 348.7913 |
| S14011 | 335.7183 |
| S14026 | 332.3406 |
| S14045 | 330.0821 |
| S14047 | 330.0771 |
Commentary :
This code is using thedplyrpackage to group thedf_receiptdata frame bystore_cd, calculate the meanamountfor each group, and then filter the resulting data frame to include only those groups where the meanamountis greater than or equal to 330.
Here's a step-by-step explanation of what the code is doing:
The%>%operator is used to pipe thedf_receiptdata frame to the next function in the chain.
Thegroup_byfunction is used to group the data bystore_cd.
Thesummarisefunction is used to calculate a summary statistic for each group. In this case, it is using themeanfunction to calculate the meanamountfor each group, and creating a new column calledmean_amount.
The.groups = "drop"argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thefilterfunction is used to subset the data frame to only include rows where themean_amountis greater than or equal to 330.
So overall, this code is grouping the data bystore_cd, calculating the meanamountfor each group, and then selecting only those groups where the meanamountis greater than or equal to 330. The resulting data frame will include one row for each qualifying store, with the store code and mean amount.
R-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.
df_mean <- df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
summarise(mean_amount = mean(sum_amount))
df_mean$mean_amount
2547.74223452926
Commentary :
This code is using thedplyrpackage to calculate the mean amount spent by each customer, after excluding any customer whosecustomer_idstarts with the letter "Z".
Here's a step-by-step explanation of what the code is doing:
The%>%operator is used to pipe thedf_receiptdata frame to the next function in the chain.
Thefilterfunction is used to exclude any rows where thecustomer_idstarts with the letter "Z".
Thegroup_byfunction is used to group the data bycustomer_id.
Thesummarisefunction is used to calculate a summary statistic for each group. In this case, it is using thesumfunction to calculate the totalamountspent by each customer, and creating a new column calledsum_amount.
The.groups = "drop"argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Anothersummarisefunction is used to calculate the meansum_amountacross all customers.
The resultingmean_amountvalue is extracted using$mean_amount.
So overall, this code is filtering the data to exclude any rows where thecustomer_idstarts with "Z", grouping the data bycustomer_id, calculating the totalamountspent by each customer, calculating the meansum_amountacross all customers, and returning the resulting mean value. The resulting output is a single numeric value representing the mean amount spent by each customer.
R-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.
df_sum <- df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop")
df_mean <- df_sum %>%
summarise(mean_amount = mean(sum_amount))
df_sum %>%
filter(sum_amount >= df_mean$mean_amount) %>%
slice(1:10)
| customer_id | sum_amount |
|---|---|
| <chr> | <int> |
| CS001115000010 | 3044 |
| CS001205000006 | 3337 |
| CS001214000009 | 4685 |
| CS001214000017 | 4132 |
| CS001214000052 | 5639 |
| CS001215000040 | 3496 |
| CS001304000006 | 3726 |
| CS001305000005 | 3485 |
| CS001305000011 | 4370 |
| CS001315000180 | 3300 |
Commentary :
This code is using thedplyrpackage to filter the customers who have spent an amount greater than or equal to the mean amount spent by all customers, after excluding any customer whosecustomer_idstarts with the letter "Z". It then selects the top 10 customers based on their totalamountspent.
Here's a step-by-step explanation of what the code is doing:
The%>%operator is used to pipe thedf_receiptdata frame to the next function in the chain.
Thefilterfunction is used to exclude any rows where thecustomer_idstarts with the letter "Z".
Thegroup_byfunction is used to group the data bycustomer_id.
Thesummarisefunction is used to calculate a summary statistic for each group. In this case, it is using thesumfunction to calculate the totalamountspent by each customer, and creating a new column calledsum_amount.
The.groups = "drop"argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thedf_sumdata frame is created by storing the output of the above operations.
Thesummarisefunction is used again to calculate the meansum_amountacross all customers indf_sum.
The resultingmean_amountvalue is extracted using$mean_amount.
The%>%operator is used to pipedf_sumto the next function in the chain.
Thefilterfunction is used to select only those rows where thesum_amountis greater than or equal to themean_amountcalculated in step 7.
Theslicefunction is used to select only the top 10 rows of the resulting data frame, based on their order in the original data frame.
So overall, this code is filtering the data to exclude any rows where thecustomer_idstarts with "Z", grouping the data bycustomer_id, calculating the totalamountspent by each customer, calculating the meansum_amountacross all customers, creating a data frame of customers whose totalamountspent is greater than or equal to the mean amount spent by all customers, and selecting the top 10 customers based on their totalamountspent.
R-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.
inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd") %>% slice(1:10)
| sales_ymd | sales_epoch | store_cd | receipt_no | receipt_sub_no | customer_id | product_cd | quantity | amount | store_name |
|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <chr> | <int> | <int> | <chr> | <chr> | <int> | <int> | <chr> |
| 20181103 | 1541203200 | S14006 | 112 | 1 | CS006214000001 | P070305012 | 1 | 158 | 葛が谷店 |
| 20181118 | 1542499200 | S13008 | 1132 | 2 | CS008415000097 | P070701017 | 1 | 81 | 成城店 |
| 20170712 | 1499817600 | S14028 | 1102 | 1 | CS028414000014 | P060101005 | 1 | 170 | 二ツ橋店 |
| 20190205 | 1549324800 | S14042 | 1132 | 1 | ZZ000000000000 | P050301001 | 1 | 25 | 新山下店 |
| 20180821 | 1534809600 | S14025 | 1102 | 2 | CS025415000050 | P060102007 | 1 | 90 | 大和店 |
| 20190605 | 1559692800 | S13003 | 1112 | 1 | CS003515000195 | P050102002 | 1 | 138 | 狛江店 |
| 20181205 | 1543968000 | S14024 | 1102 | 2 | CS024514000042 | P080101005 | 1 | 30 | 三田店 |
| 20190922 | 1569110400 | S14040 | 1102 | 1 | CS040415000178 | P070501004 | 1 | 128 | 長津田店 |
| 20170504 | 1493856000 | S13020 | 1112 | 2 | ZZ000000000000 | P071302010 | 1 | 770 | 十条仲原店 |
| 20191010 | 1570665600 | S14027 | 1102 | 1 | CS027514000015 | P071101003 | 1 | 680 | 南藤沢店 |
Commentary :
This code performs an inner join operation between two data frames,df_receiptanddf_store, based on a common column "store_cd". The result of this operation is then piped into theslice()function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd"): This performs an inner join between thedf_receiptdata frame and a subset ofdf_storethat only contains the columns "store_cd" and "store_name". The join is done on the "store_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "store_cd" column in both data frames.%>%: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.slice(1:10): This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation betweendf_receiptand a subset ofdf_store. The resulting data frame will only contain rows where there is a match between the "store_cd" column in both data frames, and it will have the columns "store_cd" and "store_name" fromdf_store, along with all the columns fromdf_receipt.
R-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.
inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd") %>% slice(1:10)
| product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_small_name |
|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> |
| P040101001 | 04 | 0401 | 040101 | 198 | 149 | 弁当類 |
| P040101002 | 04 | 0401 | 040101 | 218 | 164 | 弁当類 |
| P040101003 | 04 | 0401 | 040101 | 230 | 173 | 弁当類 |
| P040101004 | 04 | 0401 | 040101 | 248 | 186 | 弁当類 |
| P040101005 | 04 | 0401 | 040101 | 268 | 201 | 弁当類 |
| P040101006 | 04 | 0401 | 040101 | 298 | 224 | 弁当類 |
| P040101007 | 04 | 0401 | 040101 | 338 | 254 | 弁当類 |
| P040101008 | 04 | 0401 | 040101 | 420 | 315 | 弁当類 |
| P040101009 | 04 | 0401 | 040101 | 498 | 374 | 弁当類 |
| P040101010 | 04 | 0401 | 040101 | 580 | 435 | 弁当類 |
Commentary :
This code performs an inner join operation between two data frames,df_productanddf_category, based on a common column "category_small_cd". The result of this operation is then piped into theslice()function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd"): This performs an inner join between thedf_productdata frame and a subset ofdf_categorythat only contains the columns "category_small_cd" and "category_small_name". The join is done on the "category_small_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "category_small_cd" column in both data frames.%>%: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.slice(1:10): This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation betweendf_productand a subset ofdf_category. The resulting data frame will only contain rows where there is a match between the "category_small_cd" column in both data frames, and it will have the columns "category_small_cd" and "category_small_name" fromdf_category, along with all the columns fromdf_product.
R-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.
df_sum <- df_receipt %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop")
df_target <- df_customer %>%
filter(gender_cd == "1" & !grepl("^Z", customer_id))
left_join(df_target["customer_id"], df_sum, by = "customer_id") %>%
replace_na(list(sum_amount = 0)) %>% slice(1:10)
| customer_id | sum_amount |
|---|---|
| <chr> | <dbl> |
| CS021313000114 | 0 |
| CS031415000172 | 5088 |
| CS028811000001 | 0 |
| CS001215000145 | 875 |
| CS015414000103 | 3122 |
| CS033513000180 | 868 |
| CS035614000014 | 0 |
| CS011215000048 | 3444 |
| CS009413000079 | 0 |
| CS040412000191 | 210 |
Commentary :
This code performs several operations to obtain a summary table for the customer data. The result of this operation is then piped into thereplace_na()function to replace any missing values with a default value of 0. Finally, the result is piped into theslice()function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:df_sum <- df_receipt %>% group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop"): This code first groups thedf_receiptdata frame by the "customer_id" column, and then summarizes the "amount" column using thesum()function. The resulting data frame,df_sum, contains the total amount spent by each customer. The.groupsparameter is set to "drop" to remove the grouping information from the output.df_target <- df_customer %>% filter(gender_cd == "1" & !grepl("^Z", customer_id)): This code filters thedf_customerdata frame to only include rows where the "gender_cd" column equals "1" and the "customer_id" column does not start with the letter "Z". The resulting data frame,df_target, contains only male customers whose IDs do not start with "Z".left_join(df_target["customer_id"], df_sum, by = "customer_id"): This performs a left join operation betweendf_target["customer_id"]anddf_sumbased on the "customer_id" column. The result is a new data frame that contains all the rows fromdf_targetalong with the corresponding total amount spent by each customer fromdf_sum.replace_na(list(sum_amount = 0)): This replaces any missing values in the "sum_amount" column with a default value of 0.%>% slice(1:10): This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to obtain a summary table of the total amount spent by male customers whose IDs do not start with the letter "Z". The resulting data frame will have two columns, "customer_id" and "sum_amount", where "sum_amount" is the total amount spent by each customer. Any missing values in the "sum_amount" column will be replaced with 0, and the resulting data frame will be limited to the first 10 rows.
R-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).
# Code example 1
df_data <- df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id)
df_cnt <- df_data %>%
summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>%
arrange(desc(come_days), customer_id) %>%
slice(1:20)
df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>%
arrange(desc(sum_amount)) %>%
slice(1:20)
full_join(df_cnt, df_sum, by = "customer_id")
| customer_id | come_days | sum_amount |
|---|---|---|
| <chr> | <int> | <int> |
| CS040214000008 | 23 | NA |
| CS010214000010 | 22 | 18585 |
| CS015415000185 | 22 | 20153 |
| CS010214000002 | 21 | NA |
| CS028415000007 | 21 | 19127 |
| CS016415000141 | 20 | 18372 |
| CS017415000097 | 20 | 23086 |
| CS014214000023 | 19 | NA |
| CS021514000045 | 19 | NA |
| CS021515000172 | 19 | NA |
| CS022515000226 | 19 | NA |
| CS031414000051 | 19 | 19202 |
| CS039414000052 | 19 | NA |
| CS007515000107 | 18 | NA |
| CS014415000077 | 18 | NA |
| CS021515000056 | 18 | NA |
| CS021515000211 | 18 | NA |
| CS022515000028 | 18 | NA |
| CS030214000008 | 18 | NA |
| CS031414000073 | 18 | NA |
| CS001605000009 | NA | 18925 |
| CS006515000023 | NA | 18372 |
| CS011414000106 | NA | 18338 |
| CS038415000104 | NA | 17847 |
| CS035414000024 | NA | 17615 |
| CS021515000089 | NA | 17580 |
| CS032414000072 | NA | 16563 |
| CS016415000101 | NA | 16348 |
| CS011415000006 | NA | 16094 |
| CS034415000047 | NA | 16083 |
| CS007514000094 | NA | 15735 |
| CS009414000059 | NA | 15492 |
| CS030415000034 | NA | 15468 |
| CS015515000034 | NA | 15300 |
Commentary :
This code performs several operations to obtain two summary tables for the customer data, and then performs a full join operation between these two tables based on the "customer_id" column. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
Let's break down the code step by step:df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id): This code first filters thedf_receiptdata frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column. The resultingdf_datadata frame contains all the rows fromdf_receiptwith "Z" customer IDs removed, and is grouped by "customer_id".df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>% arrange(desc(come_days), customer_id) %>% slice(1:20): This code first summarizes thedf_datadata frame by counting the number of distinct "sales_ymd" values for each customer using then_distinct()function. The resulting data frame,df_cnt, contains the total number of unique sales days for each customer. The data frame is then arranged in descending order of "come_days" followed by "customer_id". The first 20 rows of the resulting data frame are then selected using theslice()function.df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>% arrange(desc(sum_amount)) %>% slice(1:20): This code summarizes thedf_datadata frame by calculating the sum of the "amount" column for each customer using thesum()function. The resulting data frame,df_sum, contains the total amount spent by each customer. The data frame is then arranged in descending order of "sum_amount", and the first 20 rows of the resulting data frame are selected using theslice()function.full_join(df_cnt, df_sum, by = "customer_id"): This code performs a full join operation betweendf_cntanddf_sumbased on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
# Code example 2
# Code example using slice_max.
# Rewrite with_ties = FALSE to with_ties = TRUE to change the form to include the same rank
df_data <- df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id)
df_cnt <- df_data %>%
summarise(come_days = n_distinct(sales_ymd)) %>%
slice_max(come_days, n = 20, with_ties = FALSE)
df_sum <- df_data %>%
summarise(sum_amount = sum(amount)) %>%
slice_max(sum_amount, n = 20, with_ties = FALSE)
full_join(df_cnt, df_sum, by = "customer_id")
| customer_id | come_days | sum_amount |
|---|---|---|
| <chr> | <int> | <int> |
| CS040214000008 | 23 | NA |
| CS010214000010 | 22 | 18585 |
| CS015415000185 | 22 | 20153 |
| CS010214000002 | 21 | NA |
| CS028415000007 | 21 | 19127 |
| CS016415000141 | 20 | 18372 |
| CS017415000097 | 20 | 23086 |
| CS014214000023 | 19 | NA |
| CS021514000045 | 19 | NA |
| CS021515000172 | 19 | NA |
| CS022515000226 | 19 | NA |
| CS031414000051 | 19 | 19202 |
| CS039414000052 | 19 | NA |
| CS007515000107 | 18 | NA |
| CS014415000077 | 18 | NA |
| CS021515000056 | 18 | NA |
| CS021515000211 | 18 | NA |
| CS022515000028 | 18 | NA |
| CS030214000008 | 18 | NA |
| CS031414000073 | 18 | NA |
| CS001605000009 | NA | 18925 |
| CS006515000023 | NA | 18372 |
| CS011414000106 | NA | 18338 |
| CS038415000104 | NA | 17847 |
| CS035414000024 | NA | 17615 |
| CS021515000089 | NA | 17580 |
| CS032414000072 | NA | 16563 |
| CS016415000101 | NA | 16348 |
| CS011415000006 | NA | 16094 |
| CS034415000047 | NA | 16083 |
| CS007514000094 | NA | 15735 |
| CS009414000059 | NA | 15492 |
| CS030415000034 | NA | 15468 |
| CS015515000034 | NA | 15300 |
Commentary :
This code is similar to the previous code we discussed, but it uses theslice_max()function instead of thearrange()function to obtain the top 20 customers based on the number of distinct sales days and the total amount spent.
Here is a breakdown of the code:df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id): This code is similar to the previous code, where it filters thedf_receiptdata frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column.df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd)) %>% slice_max(come_days, n = 20, with_ties = FALSE): This code summarizes thedf_datadata frame by counting the number of distinct "sales_ymd" values for each customer using then_distinct()function. The resulting data frame,df_cnt, contains the total number of unique sales days for each customer. Theslice_max()function is used to select the top 20 rows based on the "come_days" column, wherewith_ties = FALSEspecifies that only the top 20 distinct values should be selected.df_sum <- df_data %>% summarise(sum_amount = sum(amount)) %>% slice_max(sum_amount, n = 20, with_ties = FALSE): This code summarizes thedf_datadata frame by calculating the sum of the "amount" column for each customer using thesum()function. The resulting data frame,df_sum, contains the total amount spent by each customer. Theslice_max()function is used to select the top 20 rows based on the "sum_amount" column, wherewith_ties = FALSEspecifies that only the top 20 distinct values should be selected.full_join(df_cnt, df_sum, by = "customer_id"): This code performs a full join operation betweendf_cntanddf_sumbased on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent, obtained using theslice_max()function instead of thearrange()function.
R-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.
df_store_tmp <- df_store
df_product_tmp <- df_product
df_store_tmp["key"] <- 0
df_product_tmp["key"] <- 0
nrow(full_join(df_store_tmp, df_product_tmp, by = "key"))
531590
Commentary :
This code performs a full join operation between thedf_storeanddf_productdata frames based on a common "key" column that is added to both data frames. The resulting data frame is then used to calculate the number of rows using thenrow()function.
Here is a breakdown of the code:df_store_tmp <- df_store: This code creates a new data frame calleddf_store_tmpthat is a copy of thedf_storedata frame. This is done to avoid modifying the original data frame.df_product_tmp <- df_product: This code creates a new data frame calleddf_product_tmpthat is a copy of thedf_productdata frame. This is done to avoid modifying the original data frame.df_store_tmp["key"] <- 0: This code adds a new column called "key" to thedf_store_tmpdata frame and initializes all the values to zero. This column is added to provide a common column for the join operation.df_product_tmp["key"] <- 0: This code adds a new column called "key" to thedf_product_tmpdata frame and initializes all the values to zero. This column is added to provide a common column for the join operation.full_join(df_store_tmp, df_product_tmp, by = "key"): This code performs a full join operation between thedf_store_tmpanddf_product_tmpdata frames based on the common "key" column. Since both data frames have the same value in this column for all rows, a full join will result in a Cartesian product of the two data frames, meaning that every row fromdf_store_tmpwill be paired with every row fromdf_product_tmp.nrow(full_join(df_store_tmp, df_product_tmp, by = "key")): This code calculates the number of rows in the resulting data frame from the full join operation using thenrow()function. The result is the total number of combinations of rows betweendf_storeanddf_product.
In summary, this code creates copies of thedf_storeanddf_productdata frames, adds a common "key" column to both data frames with all values set to zero, performs a full join operation between the two data frames based on the common "key" column, and calculates the number of resulting rows. The purpose of this code is to determine the total number of combinations of rows between thedf_storeanddf_productdata frames.









Comment