参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- To begin with, run the following cells
- Import the required libraries and read the data from the database (PostgreSQL)
- The libraries that are expected to be used are imported in the following cells
- If there are other libraries you wish to use, install them as appropriate using install.packages().
- Names, addresses, etc. are dummy data and are not real.
require("RPostgreSQL")
require("tidyr")
require("dplyr")
require("stringr")
require("caret")
require("lubridate")
require("rsample")
require("recipes")
require("themis")
host <- "db"
port <- Sys.getenv()["PG_PORT"]
dbname <- Sys.getenv()["PG_DATABASE"]
user <- Sys.getenv()["PG_USER"]
password <- Sys.getenv()["PG_PASSWORD"]
con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password)
df_customer <- dbGetQuery(con,"SELECT * FROM customer")
df_category <- dbGetQuery(con,"SELECT * FROM category")
df_product <- dbGetQuery(con,"SELECT * FROM product")
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt")
df_store <- dbGetQuery(con,"SELECT * FROM store")
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")
Loading required package: RPostgreSQL Loading required package: DBI Loading required package: tidyr Loading required package: dplyr Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: stringr Loading required package: caret Loading required package: ggplot2 Loading required package: lattice Loading required package: lubridate Attaching package: ‘lubridate’ The following objects are masked from ‘package:base’: date, intersect, setdiff, union Loading required package: rsample Loading required package: recipes Attaching package: ‘recipes’ The following object is masked from ‘package:stringr’: fixed The following object is masked from ‘package:stats’: step Loading required package: themis Registered S3 methods overwritten by 'themis': method from bake.step_downsample recipes bake.step_upsample recipes prep.step_downsample recipes prep.step_upsample recipes tidy.step_downsample recipes tidy.step_upsample recipes tunable.step_downsample recipes tunable.step_upsample recipes Attaching package: ‘themis’ The following objects are masked from ‘package:recipes’: step_downsample, step_upsample
Practice Questions
R-021: Count the number of cases against the receipt details data (df_receipt).
nrow(df_receipt)
104681
Commentary :
The code "nrow(df_receipt)" is written in R or a similar statistical programming language. It is used to determine the number of rows in a data frame called "df_receipt".
Here's a breakdown of the code:
"nrow()" is a built-in R function that returns the number of rows in a data frame.
"df_receipt" is the name of the data frame whose number of rows we want to determine.
For example, if the data frame "df_receipt" contains 100 rows of data, then running this code would return the value 100. This code is useful for quickly checking the size of a data frame and ensuring that it has the expected number of rows before performing further analysis or manipulating the data in some way.
R-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).
length(unique(df_receipt$customer_id))
8307
Commentary :
The code "length(unique(df_receipt$customer_id))" is written in R or a similar statistical programming language. It is used to determine the number of unique customers in a data frame called "df_receipt".
Here's a breakdown of the code:
"unique()" is an R function that returns a vector containing only the unique elements of a vector. In this case, the vector being passed to the "unique()" function is "df_receipt$customer_id", which is a column in the "df_receipt" data frame that contains customer IDs.
"length()" is another built-in R function that returns the number of elements in a vector. In this case, the vector being passed to "length()" is the output of the "unique()" function, which is a vector containing unique customer IDs.
Putting it all together, the code calculates the number of unique customers in the "df_receipt" data frame by taking the length of a vector of unique customer IDs. For example, if the data frame contains 1000 rows, but only 500 unique customer IDs, then running this code would return the value 500. This code is useful for understanding the number of unique customers in a data set and for calculating customer-level metrics such as customer lifetime value.
R-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).
df_receipt %>%
group_by(store_cd) %>%
summarise(amount = sum(amount), quantity = sum(quantity), .groups = "drop")
store_cd | amount | quantity |
---|---|---|
<chr> | <int> | <int> |
S12007 | 638761 | 2099 |
S12013 | 787513 | 2425 |
S12014 | 725167 | 2358 |
S12029 | 794741 | 2555 |
S12030 | 684402 | 2403 |
S13001 | 811936 | 2347 |
S13002 | 727821 | 2340 |
S13003 | 764294 | 2197 |
S13004 | 779373 | 2390 |
S13005 | 629876 | 2004 |
S13008 | 809288 | 2491 |
S13009 | 808870 | 2486 |
S13015 | 780873 | 2248 |
S13016 | 793773 | 2432 |
S13017 | 748221 | 2376 |
S13018 | 790535 | 2562 |
S13019 | 827833 | 2541 |
S13020 | 796383 | 2383 |
S13031 | 705968 | 2336 |
S13032 | 790501 | 2491 |
S13035 | 715869 | 2219 |
S13037 | 693087 | 2344 |
S13038 | 708884 | 2337 |
S13039 | 611888 | 1981 |
S13041 | 728266 | 2233 |
S13043 | 587895 | 1881 |
S13044 | 520764 | 1729 |
S13051 | 107452 | 354 |
S13052 | 100314 | 250 |
S14006 | 712839 | 2284 |
S14010 | 790361 | 2290 |
S14011 | 805724 | 2434 |
S14012 | 720600 | 2412 |
S14021 | 699511 | 2231 |
S14022 | 651328 | 2047 |
S14023 | 727630 | 2258 |
S14024 | 736323 | 2417 |
S14025 | 755581 | 2394 |
S14026 | 824537 | 2503 |
S14027 | 714550 | 2303 |
S14028 | 786145 | 2458 |
S14033 | 725318 | 2282 |
S14034 | 653681 | 2024 |
S14036 | 203694 | 635 |
S14040 | 701858 | 2233 |
S14042 | 534689 | 1935 |
S14045 | 458484 | 1398 |
S14046 | 412646 | 1354 |
S14047 | 338329 | 1041 |
S14048 | 234276 | 769 |
S14049 | 230808 | 788 |
S14050 | 167090 | 580 |
Commentary :
The code "df_receipt %>% group_by(store_cd) %>% summarise(amount = sum(amount), quantity = sum(quantity), .groups = 'drop')" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "store_cd" and then calculate the sum of the "amount" and "quantity" variables within each store group.
Here's a breakdown of the code:
The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()" and "summarise()" functions together.
"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "store_cd" variable.
"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the sum of the "amount" and "quantity" variables within each store group. We are also specifying the names of the new variables using the assignment operator "=".
The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.
The result of this code will be a new data frame with one row for each store in the "df_receipt" data frame. The "amount" and "quantity" columns will contain the sums of those variables within each store group. This code is useful for calculating store-level metrics such as total sales or units sold.
R-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).
df_receipt %>%
group_by(customer_id) %>%
summarise(max_ymd = max(sales_ymd), .groups = "drop") %>%
slice(1:10)
customer_id | max_ymd |
---|---|
<chr> | <int> |
CS001113000004 | 20190308 |
CS001114000005 | 20190731 |
CS001115000010 | 20190405 |
CS001205000004 | 20190625 |
CS001205000006 | 20190224 |
CS001211000025 | 20190322 |
CS001212000027 | 20170127 |
CS001212000031 | 20180906 |
CS001212000046 | 20170811 |
CS001212000070 | 20191018 |
Commentary :
The code "df_receipt %>% group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd), .groups = 'drop') %>% slice(1:10)" is written in R or a similar statistical programming language. It is used to group the "df_receipt" data frame by a variable called "customer_id" and then find the latest date of purchase for each customer. Finally, it slices the first 10 rows of the resulting data frame.
Here's a breakdown of the code:
The pipe operator ("%>%") is used to chain together multiple functions in a sequence. It takes the output of the previous function and uses it as the input for the next function. In this case, the pipe operator is used to chain the "group_by()", "summarise()", and "slice()" functions together.
"group_by()" is an R function that is used to group a data frame by one or more variables. In this case, we are grouping the "df_receipt" data frame by the "customer_id" variable.
"summarise()" is an R function that is used to calculate summary statistics on a data frame. In this case, we are calculating the latest purchase date for each customer using the "max()" function on the "sales_ymd" variable.
The ".groups" argument in the "summarise()" function is used to specify how the output should be formatted. Setting it to "drop" will remove the grouping information from the output.
"slice()" is an R function that is used to extract a subset of rows from a data frame. In this case, we are selecting the first 10 rows of the resulting data frame.
The result of this code will be a new data frame with one row for each customer in the "df_receipt" data frame. The "max_ymd" column will contain the latest purchase date for each customer. Finally, the code selects the first 10 rows of the resulting data frame. This code is useful for understanding the most recent purchase date for each customer and can be used for various customer-level analysis and segmentation purposes.
R-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.
df_receipt %>%
group_by(customer_id) %>%
summarise(min_ymd = min(sales_ymd), .groups = "drop") %>%
slice(1:10)
customer_id | min_ymd |
---|---|
<chr> | <int> |
CS001113000004 | 20190308 |
CS001114000005 | 20180503 |
CS001115000010 | 20171228 |
CS001205000004 | 20170914 |
CS001205000006 | 20180207 |
CS001211000025 | 20190322 |
CS001212000027 | 20170127 |
CS001212000031 | 20180906 |
CS001212000046 | 20170811 |
CS001212000070 | 20191018 |
Commentary :
This code is using the%>%
(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt
.
Here's a step-by-step explanation of what each operation does:group_by(customer_id)
: This groups the data by thecustomer_id
column, so that subsequent operations are performed separately for each customer.summarise(min_ymd = min(sales_ymd), .groups = "drop")
: This creates a summary of the data within each group, calculating the minimum value of thesales_ymd
column and storing it in a new column calledmin_ymd
. The.groups = "drop"
argument tellsdplyr
to drop the grouping information from the resulting data frame (since we don't need it for further analysis).slice(1:10)
: This selects the first 10 rows of the resulting data frame. Since the data was grouped bycustomer_id
and the summary function was used, this will give us the earliestsales_ymd
value for the first 10 customers in the data set.
Overall, this code is selecting the earliestsales_ymd
value for each customer in thedf_receipt
data frame, and returning the earliest dates for the first 10 customers in the data set.
R-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.
df_receipt %>%
group_by(customer_id) %>% summarise(max_ymd = max(sales_ymd) ,min_ymd = min(sales_ymd), .groups = "drop") %>%
filter(max_ymd != min_ymd) %>%
slice(1:10)
customer_id | max_ymd | min_ymd |
---|---|---|
<chr> | <int> | <int> |
CS001114000005 | 20190731 | 20180503 |
CS001115000010 | 20190405 | 20171228 |
CS001205000004 | 20190625 | 20170914 |
CS001205000006 | 20190224 | 20180207 |
CS001214000009 | 20190902 | 20170306 |
CS001214000017 | 20191006 | 20180828 |
CS001214000048 | 20190929 | 20171109 |
CS001214000052 | 20190617 | 20180208 |
CS001215000005 | 20181021 | 20170206 |
CS001215000040 | 20171022 | 20170214 |
Commentary :
This code is using the%>%
(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt
.
Here's a step-by-step explanation of what each operation does:group_by(customer_id)
: This groups the data by thecustomer_id
column, so that subsequent operations are performed separately for each customer.summarise(max_ymd = max(sales_ymd), min_ymd = min(sales_ymd), .groups = "drop")
: This creates a summary of the data within each group, calculating the maximum and minimum values of thesales_ymd
column and storing them in new columns calledmax_ymd
andmin_ymd
, respectively. The.groups = "drop"
argument tellsdplyr
to drop the grouping information from the resulting data frame (since we don't need it for further analysis).filter(max_ymd != min_ymd)
: This filters the data frame to include only those rows where themax_ymd
value is not equal to themin_ymd
value. This will exclude any customers who made only a single purchase during the period covered by the data set.slice(1:10)
: This selects the first 10 rows of the resulting data frame. Since the data was grouped bycustomer_id
, summarized using themax
andmin
functions, and filtered based on those summary statistics, this will give us the first 10 customers who made more than one purchase during the period covered by the data set.
Overall, this code is selecting the customers who made more than one purchase during the period covered by thedf_receipt
data frame, and returning the dates of their earliest and latest purchases. The code then returns the earliest and latest purchases for the first 10 customers who meet these criteria.
R-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.
df_receipt %>%
group_by(store_cd) %>%
summarise(mean_amount = mean(amount), .groups = "drop") %>%
arrange(desc(mean_amount)) %>%
slice(1:5)
store_cd | mean_amount |
---|---|
<chr> | <dbl> |
S13052 | 402.8675 |
S13015 | 351.1120 |
S13003 | 350.9155 |
S14010 | 348.7913 |
S13001 | 348.4704 |
Commentary :
This code is using the%>%
(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt
.
Here's a step-by-step explanation of what each operation does:group_by(store_cd)
: This groups the data by thestore_cd
column, so that subsequent operations are performed separately for each store.summarise(mean_amount = mean(amount), .groups = "drop")
: This creates a summary of the data within each group, calculating the mean value of theamount
column and storing it in a new column calledmean_amount
. The.groups = "drop"
argument tellsdplyr
to drop the grouping information from the resulting data frame (since we don't need it for further analysis).arrange(desc(mean_amount))
: This sorts the resulting data frame by themean_amount
column in descending order (from highest to lowest).slice(1:5)
: This selects the first 5 rows of the resulting data frame. Since the data was grouped bystore_cd
and summarized using themean
function, this will give us the stores with the highest meanamount
values across all transactions.
Overall, this code is selecting the stores with the highest meanamount
values across all transactions in thedf_receipt
data frame, and returning the top 5 stores based on those values.
R-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.
df_receipt %>%
group_by(store_cd) %>%
summarise(median_amount = median(amount), .groups = "drop") %>% arrange(desc(median_amount)) %>%
slice(1:5)
store_cd | median_amount |
---|---|
<chr> | <dbl> |
S13052 | 190 |
S14010 | 188 |
S14050 | 185 |
S13003 | 180 |
S13018 | 180 |
Commentary :
This code is using the%>%
(pipe) operator to chain together a sequence of operations on a data frame calleddf_receipt
.
Here's a step-by-step explanation of what each operation does:group_by(store_cd)
: This groups the data by thestore_cd
column, so that subsequent operations are performed separately for each store.summarise(median_amount = median(amount), .groups = "drop")
: This creates a summary of the data within each group, calculating the median value of theamount
column and storing it in a new column calledmedian_amount
. The.groups = "drop"
argument tellsdplyr
to drop the grouping information from the resulting data frame (since we don't need it for further analysis).arrange(desc(median_amount))
: This sorts the resulting data frame by themedian_amount
column in descending order (from highest to lowest).slice(1:5)
: This selects the first 5 rows of the resulting data frame. Since the data was grouped bystore_cd
and summarized using themedian
function, this will give us the stores with the highest medianamount
values across all transactions.
Overall, this code is selecting the stores with the highest medianamount
values across all transactions in thedf_receipt
data frame, and returning the top 5 stores based on those values. The difference between this code and the previous one I explained is that this code is using themedian
function instead of themean
function to calculate the central tendency of theamount
values for each store.
R-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.
# Code example 1
df_receipt %>%
group_by(store_cd, product_cd) %>%
summarise(cnt = n(), .groups = "drop_last") %>%
filter(cnt == cnt %>% max()) %>%
ungroup() %>%
slice(1:10)
store_cd | product_cd | cnt |
---|---|---|
<chr> | <chr> | <int> |
S12007 | P060303001 | 72 |
S12013 | P060303001 | 107 |
S12014 | P060303001 | 65 |
S12029 | P060303001 | 92 |
S12030 | P060303001 | 115 |
S13001 | P060303001 | 67 |
S13002 | P060303001 | 78 |
S13003 | P071401001 | 65 |
S13004 | P060303001 | 88 |
S13005 | P040503001 | 36 |
Commentary :
This code is using thedplyr
package to perform data manipulation on a data frame calleddf_receipt
. Here is a step-by-step explanation of what the code is doing:%>%
is a pipe operator used to chain functions together, so the output of one function becomes the input of the next function.
Thegroup_by
function groups the data frame bystore_cd
andproduct_cd
.
Thesummarise
function calculates a summary statistic for each group. In this case, it is counting the number of rows in each group and creating a new column calledcnt
.
The.groups = "drop_last"
argument is used to drop the last grouping level created by thesummarise
function, which is theproduct_cd
grouping. This is done to make it easier to filter on the maximum count later on.
Thefilter
function selects rows that meet a certain condition. In this case, it is selecting rows where thecnt
column is equal to the maximum value of thecnt
column in the data frame.
Theungroup
function removes all grouping levels from the data frame.
Theslice
function selects a subset of rows based on their position. In this case, it is selecting the first 10 rows of the data frame.
So overall, this code is grouping the data by store and product, calculating the count of each group, filtering for the groups with the highest count, and then selecting the top 10 results.
# Code example 2: using which.max() (multiple occurrences of mode number, narrowed down to one of them)
# Display of the number of cases is omitted
table_product <-table(df_receipt$store_cd,df_receipt$product_cd)
store <- names(table_product[,1])
mode_product <- c()
for (i in 1:length(store)){
mode_product[i] <- names(which.max(table_product[i,]))
}
data.frame(store_cd = store, product_cd = mode_product) %>%
slice(1:10)
store_cd | product_cd |
---|---|
<chr> | <chr> |
S12007 | P060303001 |
S12013 | P060303001 |
S12014 | P060303001 |
S12029 | P060303001 |
S12030 | P060303001 |
S13001 | P060303001 |
S13002 | P060303001 |
S13003 | P071401001 |
S13004 | P060303001 |
S13005 | P040503001 |
Commentary :
This code is using thetable
function andfor
loop to find the mode (most frequent value) of each product code for each store in thedf_receipt
data frame.
Here's a step-by-step explanation of what the code is doing:
Thetable
function creates a contingency table of counts for each combination ofstore_cd
andproduct_cd
indf_receipt
. The resulting object,table_product
, is a matrix where the rows correspond to each store and the columns correspond to each product code.
Thenames
function extracts the store codes from the first column of thetable_product
matrix and saves them as a character vector calledstore
.
An empty vector calledmode_product
is created to store the mode of each product code for each store.
Thefor
loop iterates through each store instore
and finds the mode of each product code for that store. Thewhich.max
function is used to find the index of the maximum count in each row of thetable_product
matrix, and thenames
function is used to extract the product code with that index. The resulting product code is stored in themode_product
vector.
A data frame is created with two columns:store_cd
andproduct_cd
. Thestore_cd
column is populated with the values in thestore
vector, and theproduct_cd
column is populated with the values in themode_product
vector.
The%>%
operator is used to pipe the data frame to theslice
function, which selects the first 10 rows of the data frame.
So overall, this code is finding the mode of each product code for each store indf_receipt
, creating a data frame with these values, and then selecting the top 10 results.
R-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.
var_sample <- function(x){ var(x) * (length(x) - 1) / length(x) }
df_receipt %>%
group_by(store_cd) %>%
summarise(var_amount = var_sample(amount), .groups = "drop") %>%
arrange(desc(var_amount)) %>% slice(1:5)
store_cd | var_amount |
---|---|
<chr> | <dbl> |
S13052 | 440088.7 |
S14011 | 306314.6 |
S14034 | 296920.1 |
S13001 | 295432.0 |
S13015 | 295294.4 |
Commentary :
This code is using thedplyr
package to calculate the sample variance of theamount
variable for each store in thedf_receipt
data frame, and then selecting the top 5 stores with the highest variance.
Here's a step-by-step explanation of what the code is doing:
Thevar_sample
function is defined, which calculates the sample variance of a vector of valuesx
using the formulavar(x) * (length(x) - 1) / length(x)
. This formula adjusts the variance by dividing bylength(x) - 1
instead oflength(x)
to provide an unbiased estimate of the population variance.
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thegroup_by
function is used to group the data bystore_cd
.
Thesummarise
function calculates a summary statistic for each group. In this case, it is using thevar_sample
function to calculate the sample variance of theamount
variable for each group, and creating a new column calledvar_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thearrange
function is used to sort the data frame byvar_amount
in descending order, so that the stores with the highest variance appear first.
The%>%
operator is used to pipe the sorted data frame to theslice
function, which selects the first 5 rows of the data frame.
So overall, this code is grouping the data by store, calculating the sample variance of theamount
variable for each group, sorting the resulting data frame by variance in descending order, and then selecting the top 5 results.
R-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.
var_sample <- function(x){ var(x)*(length(x)-1)/length(x) }
std_sample <- function(x){ sqrt(var_sample(x)) }
head(
df_receipt %>%
group_by(store_cd) %>%
summarise(std_amount = std_sample(amount), .groups = "drop") %>%
arrange(desc(std_amount)) ,
n = 5
)
store_cd | std_amount |
---|---|
<chr> | <dbl> |
S13052 | 663.3918 |
S14011 | 553.4569 |
S14034 | 544.9037 |
S13001 | 543.5366 |
S13015 | 543.4099 |
Commentary :
This code is using thedplyr
package to calculate the sample standard deviation of theamount
variable for each store in thedf_receipt
data frame, and then selecting the top 5 stores with the highest standard deviation.
Here's a step-by-step explanation of what the code is doing:
Thevar_sample
function is defined, which calculates the sample variance of a vector of valuesx
using the formulavar(x) * (length(x) - 1) / length(x)
. This formula adjusts the variance by dividing bylength(x) - 1
instead oflength(x)
to provide an unbiased estimate of the population variance.
Thestd_sample
function is defined, which calculates the sample standard deviation of a vector of valuesx
using the formulasqrt(var_sample(x))
.
Thehead
function is used to select the first 5 rows of the resulting data frame.
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thegroup_by
function is used to group the data bystore_cd
.
Thesummarise
function calculates a summary statistic for each group. In this case, it is using thestd_sample
function to calculate the sample standard deviation of theamount
variable for each group, and creating a new column calledstd_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thearrange
function is used to sort the data frame bystd_amount
in descending order, so that the stores with the highest standard deviation appear first.
Then = 5
argument is used to limit the output to the top 5 rows.
So overall, this code is grouping the data by store, calculating the sample standard deviation of theamount
variable for each group, sorting the resulting data frame by standard deviation in descending order, and then selecting the top 5 results.
R-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.
df_receipt %>%
summarise(amount_25per = quantile(amount, 0.25),
amount_50per = quantile(amount, 0.5),
amount_75per = quantile(amount, 0.75),
amount_100per = quantile(amount, 1.0))
amount_25per | amount_50per | amount_75per | amount_100per |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
102 | 170 | 288 | 10925 |
Commentary :
This code is using thedplyr
package to calculate the quartiles of theamount
variable in thedf_receipt
data frame.
Here's a step-by-step explanation of what the code is doing:
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thesummarise
function is used to calculate summary statistics for the data frame. In this case, it is using thequantile
function to calculate the 25th, 50th (median), 75th, and 100th percentiles of theamount
variable.
The resulting data frame will have 1 row and 4 columns, with the column namesamount_25per
,amount_50per
,amount_75per
, andamount_100per
, and the corresponding values of the quartiles.
So overall, this code is calculating the quartiles of theamount
variable in thedf_receipt
data frame, and returning a summary data frame with the quartile values.
R-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.
df_receipt %>%
group_by(store_cd) %>%
summarise(mean_amount = mean(amount), .groups = "drop") %>%
filter(mean_amount >= 330)
store_cd | mean_amount |
---|---|
<chr> | <dbl> |
S12013 | 330.1941 |
S13001 | 348.4704 |
S13003 | 350.9155 |
S13004 | 330.9439 |
S13015 | 351.1120 |
S13019 | 330.2086 |
S13020 | 337.8799 |
S13052 | 402.8675 |
S14010 | 348.7913 |
S14011 | 335.7183 |
S14026 | 332.3406 |
S14045 | 330.0821 |
S14047 | 330.0771 |
Commentary :
This code is using thedplyr
package to group thedf_receipt
data frame bystore_cd
, calculate the meanamount
for each group, and then filter the resulting data frame to include only those groups where the meanamount
is greater than or equal to 330.
Here's a step-by-step explanation of what the code is doing:
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thegroup_by
function is used to group the data bystore_cd
.
Thesummarise
function is used to calculate a summary statistic for each group. In this case, it is using themean
function to calculate the meanamount
for each group, and creating a new column calledmean_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thefilter
function is used to subset the data frame to only include rows where themean_amount
is greater than or equal to 330.
So overall, this code is grouping the data bystore_cd
, calculating the meanamount
for each group, and then selecting only those groups where the meanamount
is greater than or equal to 330. The resulting data frame will include one row for each qualifying store, with the store code and mean amount.
R-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.
df_mean <- df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
summarise(mean_amount = mean(sum_amount))
df_mean$mean_amount
2547.74223452926
Commentary :
This code is using thedplyr
package to calculate the mean amount spent by each customer, after excluding any customer whosecustomer_id
starts with the letter "Z".
Here's a step-by-step explanation of what the code is doing:
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thefilter
function is used to exclude any rows where thecustomer_id
starts with the letter "Z".
Thegroup_by
function is used to group the data bycustomer_id
.
Thesummarise
function is used to calculate a summary statistic for each group. In this case, it is using thesum
function to calculate the totalamount
spent by each customer, and creating a new column calledsum_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Anothersummarise
function is used to calculate the meansum_amount
across all customers.
The resultingmean_amount
value is extracted using$mean_amount
.
So overall, this code is filtering the data to exclude any rows where thecustomer_id
starts with "Z", grouping the data bycustomer_id
, calculating the totalamount
spent by each customer, calculating the meansum_amount
across all customers, and returning the resulting mean value. The resulting output is a single numeric value representing the mean amount spent by each customer.
R-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.
df_sum <- df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop")
df_mean <- df_sum %>%
summarise(mean_amount = mean(sum_amount))
df_sum %>%
filter(sum_amount >= df_mean$mean_amount) %>%
slice(1:10)
customer_id | sum_amount |
---|---|
<chr> | <int> |
CS001115000010 | 3044 |
CS001205000006 | 3337 |
CS001214000009 | 4685 |
CS001214000017 | 4132 |
CS001214000052 | 5639 |
CS001215000040 | 3496 |
CS001304000006 | 3726 |
CS001305000005 | 3485 |
CS001305000011 | 4370 |
CS001315000180 | 3300 |
Commentary :
This code is using thedplyr
package to filter the customers who have spent an amount greater than or equal to the mean amount spent by all customers, after excluding any customer whosecustomer_id
starts with the letter "Z". It then selects the top 10 customers based on their totalamount
spent.
Here's a step-by-step explanation of what the code is doing:
The%>%
operator is used to pipe thedf_receipt
data frame to the next function in the chain.
Thefilter
function is used to exclude any rows where thecustomer_id
starts with the letter "Z".
Thegroup_by
function is used to group the data bycustomer_id
.
Thesummarise
function is used to calculate a summary statistic for each group. In this case, it is using thesum
function to calculate the totalamount
spent by each customer, and creating a new column calledsum_amount
.
The.groups = "drop"
argument is used to drop the grouping information from the output, making it a flat data frame rather than a grouped tibble.
Thedf_sum
data frame is created by storing the output of the above operations.
Thesummarise
function is used again to calculate the meansum_amount
across all customers indf_sum
.
The resultingmean_amount
value is extracted using$mean_amount
.
The%>%
operator is used to pipedf_sum
to the next function in the chain.
Thefilter
function is used to select only those rows where thesum_amount
is greater than or equal to themean_amount
calculated in step 7.
Theslice
function is used to select only the top 10 rows of the resulting data frame, based on their order in the original data frame.
So overall, this code is filtering the data to exclude any rows where thecustomer_id
starts with "Z", grouping the data bycustomer_id
, calculating the totalamount
spent by each customer, calculating the meansum_amount
across all customers, creating a data frame of customers whose totalamount
spent is greater than or equal to the mean amount spent by all customers, and selecting the top 10 customers based on their totalamount
spent.
R-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.
inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd") %>% slice(1:10)
sales_ymd | sales_epoch | store_cd | receipt_no | receipt_sub_no | customer_id | product_cd | quantity | amount | store_name |
---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <chr> | <int> | <int> | <chr> | <chr> | <int> | <int> | <chr> |
20181103 | 1541203200 | S14006 | 112 | 1 | CS006214000001 | P070305012 | 1 | 158 | 葛が谷店 |
20181118 | 1542499200 | S13008 | 1132 | 2 | CS008415000097 | P070701017 | 1 | 81 | 成城店 |
20170712 | 1499817600 | S14028 | 1102 | 1 | CS028414000014 | P060101005 | 1 | 170 | 二ツ橋店 |
20190205 | 1549324800 | S14042 | 1132 | 1 | ZZ000000000000 | P050301001 | 1 | 25 | 新山下店 |
20180821 | 1534809600 | S14025 | 1102 | 2 | CS025415000050 | P060102007 | 1 | 90 | 大和店 |
20190605 | 1559692800 | S13003 | 1112 | 1 | CS003515000195 | P050102002 | 1 | 138 | 狛江店 |
20181205 | 1543968000 | S14024 | 1102 | 2 | CS024514000042 | P080101005 | 1 | 30 | 三田店 |
20190922 | 1569110400 | S14040 | 1102 | 1 | CS040415000178 | P070501004 | 1 | 128 | 長津田店 |
20170504 | 1493856000 | S13020 | 1112 | 2 | ZZ000000000000 | P071302010 | 1 | 770 | 十条仲原店 |
20191010 | 1570665600 | S14027 | 1102 | 1 | CS027514000015 | P071101003 | 1 | 680 | 南藤沢店 |
Commentary :
This code performs an inner join operation between two data frames,df_receipt
anddf_store
, based on a common column "store_cd". The result of this operation is then piped into theslice()
function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:inner_join(df_receipt, df_store[c("store_cd", "store_name")], by = "store_cd")
: This performs an inner join between thedf_receipt
data frame and a subset ofdf_store
that only contains the columns "store_cd" and "store_name". The join is done on the "store_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "store_cd" column in both data frames.%>%
: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.slice(1:10)
: This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation betweendf_receipt
and a subset ofdf_store
. The resulting data frame will only contain rows where there is a match between the "store_cd" column in both data frames, and it will have the columns "store_cd" and "store_name" fromdf_store
, along with all the columns fromdf_receipt
.
R-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.
inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd") %>% slice(1:10)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_small_name |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> |
P040101001 | 04 | 0401 | 040101 | 198 | 149 | 弁当類 |
P040101002 | 04 | 0401 | 040101 | 218 | 164 | 弁当類 |
P040101003 | 04 | 0401 | 040101 | 230 | 173 | 弁当類 |
P040101004 | 04 | 0401 | 040101 | 248 | 186 | 弁当類 |
P040101005 | 04 | 0401 | 040101 | 268 | 201 | 弁当類 |
P040101006 | 04 | 0401 | 040101 | 298 | 224 | 弁当類 |
P040101007 | 04 | 0401 | 040101 | 338 | 254 | 弁当類 |
P040101008 | 04 | 0401 | 040101 | 420 | 315 | 弁当類 |
P040101009 | 04 | 0401 | 040101 | 498 | 374 | 弁当類 |
P040101010 | 04 | 0401 | 040101 | 580 | 435 | 弁当類 |
Commentary :
This code performs an inner join operation between two data frames,df_product
anddf_category
, based on a common column "category_small_cd". The result of this operation is then piped into theslice()
function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:inner_join(df_product, df_category[c("category_small_cd", "category_small_name")], by = "category_small_cd")
: This performs an inner join between thedf_product
data frame and a subset ofdf_category
that only contains the columns "category_small_cd" and "category_small_name". The join is done on the "category_small_cd" column which is present in both data frames. The result is a new data frame that contains all the rows where there is a match between the "category_small_cd" column in both data frames.%>%
: This is the pipe operator which takes the output of the previous command and pipes it as input to the next command.slice(1:10)
: This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to select the first 10 rows of the result of an inner join operation betweendf_product
and a subset ofdf_category
. The resulting data frame will only contain rows where there is a match between the "category_small_cd" column in both data frames, and it will have the columns "category_small_cd" and "category_small_name" fromdf_category
, along with all the columns fromdf_product
.
R-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.
df_sum <- df_receipt %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop")
df_target <- df_customer %>%
filter(gender_cd == "1" & !grepl("^Z", customer_id))
left_join(df_target["customer_id"], df_sum, by = "customer_id") %>%
replace_na(list(sum_amount = 0)) %>% slice(1:10)
customer_id | sum_amount |
---|---|
<chr> | <dbl> |
CS021313000114 | 0 |
CS031415000172 | 5088 |
CS028811000001 | 0 |
CS001215000145 | 875 |
CS015414000103 | 3122 |
CS033513000180 | 868 |
CS035614000014 | 0 |
CS011215000048 | 3444 |
CS009413000079 | 0 |
CS040412000191 | 210 |
Commentary :
This code performs several operations to obtain a summary table for the customer data. The result of this operation is then piped into thereplace_na()
function to replace any missing values with a default value of 0. Finally, the result is piped into theslice()
function to select the first 10 rows of the resulting data frame.
Let's break down the code step by step:df_sum <- df_receipt %>% group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop")
: This code first groups thedf_receipt
data frame by the "customer_id" column, and then summarizes the "amount" column using thesum()
function. The resulting data frame,df_sum
, contains the total amount spent by each customer. The.groups
parameter is set to "drop" to remove the grouping information from the output.df_target <- df_customer %>% filter(gender_cd == "1" & !grepl("^Z", customer_id))
: This code filters thedf_customer
data frame to only include rows where the "gender_cd" column equals "1" and the "customer_id" column does not start with the letter "Z". The resulting data frame,df_target
, contains only male customers whose IDs do not start with "Z".left_join(df_target["customer_id"], df_sum, by = "customer_id")
: This performs a left join operation betweendf_target["customer_id"]
anddf_sum
based on the "customer_id" column. The result is a new data frame that contains all the rows fromdf_target
along with the corresponding total amount spent by each customer fromdf_sum
.replace_na(list(sum_amount = 0))
: This replaces any missing values in the "sum_amount" column with a default value of 0.%>% slice(1:10)
: This function selects the first 10 rows of the resulting data frame from the previous command.
So, the overall purpose of this code is to obtain a summary table of the total amount spent by male customers whose IDs do not start with the letter "Z". The resulting data frame will have two columns, "customer_id" and "sum_amount", where "sum_amount" is the total amount spent by each customer. Any missing values in the "sum_amount" column will be replaced with 0, and the resulting data frame will be limited to the first 10 rows.
R-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).
# Code example 1
df_data <- df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id)
df_cnt <- df_data %>%
summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>%
arrange(desc(come_days), customer_id) %>%
slice(1:20)
df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>%
arrange(desc(sum_amount)) %>%
slice(1:20)
full_join(df_cnt, df_sum, by = "customer_id")
customer_id | come_days | sum_amount |
---|---|---|
<chr> | <int> | <int> |
CS040214000008 | 23 | NA |
CS010214000010 | 22 | 18585 |
CS015415000185 | 22 | 20153 |
CS010214000002 | 21 | NA |
CS028415000007 | 21 | 19127 |
CS016415000141 | 20 | 18372 |
CS017415000097 | 20 | 23086 |
CS014214000023 | 19 | NA |
CS021514000045 | 19 | NA |
CS021515000172 | 19 | NA |
CS022515000226 | 19 | NA |
CS031414000051 | 19 | 19202 |
CS039414000052 | 19 | NA |
CS007515000107 | 18 | NA |
CS014415000077 | 18 | NA |
CS021515000056 | 18 | NA |
CS021515000211 | 18 | NA |
CS022515000028 | 18 | NA |
CS030214000008 | 18 | NA |
CS031414000073 | 18 | NA |
CS001605000009 | NA | 18925 |
CS006515000023 | NA | 18372 |
CS011414000106 | NA | 18338 |
CS038415000104 | NA | 17847 |
CS035414000024 | NA | 17615 |
CS021515000089 | NA | 17580 |
CS032414000072 | NA | 16563 |
CS016415000101 | NA | 16348 |
CS011415000006 | NA | 16094 |
CS034415000047 | NA | 16083 |
CS007514000094 | NA | 15735 |
CS009414000059 | NA | 15492 |
CS030415000034 | NA | 15468 |
CS015515000034 | NA | 15300 |
Commentary :
This code performs several operations to obtain two summary tables for the customer data, and then performs a full join operation between these two tables based on the "customer_id" column. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
Let's break down the code step by step:df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id)
: This code first filters thedf_receipt
data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column. The resultingdf_data
data frame contains all the rows fromdf_receipt
with "Z" customer IDs removed, and is grouped by "customer_id".df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>% arrange(desc(come_days), customer_id) %>% slice(1:20)
: This code first summarizes thedf_data
data frame by counting the number of distinct "sales_ymd" values for each customer using then_distinct()
function. The resulting data frame,df_cnt
, contains the total number of unique sales days for each customer. The data frame is then arranged in descending order of "come_days" followed by "customer_id". The first 20 rows of the resulting data frame are then selected using theslice()
function.df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>% arrange(desc(sum_amount)) %>% slice(1:20)
: This code summarizes thedf_data
data frame by calculating the sum of the "amount" column for each customer using thesum()
function. The resulting data frame,df_sum
, contains the total amount spent by each customer. The data frame is then arranged in descending order of "sum_amount", and the first 20 rows of the resulting data frame are selected using theslice()
function.full_join(df_cnt, df_sum, by = "customer_id")
: This code performs a full join operation betweendf_cnt
anddf_sum
based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
# Code example 2
# Code example using slice_max.
# Rewrite with_ties = FALSE to with_ties = TRUE to change the form to include the same rank
df_data <- df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id)
df_cnt <- df_data %>%
summarise(come_days = n_distinct(sales_ymd)) %>%
slice_max(come_days, n = 20, with_ties = FALSE)
df_sum <- df_data %>%
summarise(sum_amount = sum(amount)) %>%
slice_max(sum_amount, n = 20, with_ties = FALSE)
full_join(df_cnt, df_sum, by = "customer_id")
customer_id | come_days | sum_amount |
---|---|---|
<chr> | <int> | <int> |
CS040214000008 | 23 | NA |
CS010214000010 | 22 | 18585 |
CS015415000185 | 22 | 20153 |
CS010214000002 | 21 | NA |
CS028415000007 | 21 | 19127 |
CS016415000141 | 20 | 18372 |
CS017415000097 | 20 | 23086 |
CS014214000023 | 19 | NA |
CS021514000045 | 19 | NA |
CS021515000172 | 19 | NA |
CS022515000226 | 19 | NA |
CS031414000051 | 19 | 19202 |
CS039414000052 | 19 | NA |
CS007515000107 | 18 | NA |
CS014415000077 | 18 | NA |
CS021515000056 | 18 | NA |
CS021515000211 | 18 | NA |
CS022515000028 | 18 | NA |
CS030214000008 | 18 | NA |
CS031414000073 | 18 | NA |
CS001605000009 | NA | 18925 |
CS006515000023 | NA | 18372 |
CS011414000106 | NA | 18338 |
CS038415000104 | NA | 17847 |
CS035414000024 | NA | 17615 |
CS021515000089 | NA | 17580 |
CS032414000072 | NA | 16563 |
CS016415000101 | NA | 16348 |
CS011415000006 | NA | 16094 |
CS034415000047 | NA | 16083 |
CS007514000094 | NA | 15735 |
CS009414000059 | NA | 15492 |
CS030415000034 | NA | 15468 |
CS015515000034 | NA | 15300 |
Commentary :
This code is similar to the previous code we discussed, but it uses theslice_max()
function instead of thearrange()
function to obtain the top 20 customers based on the number of distinct sales days and the total amount spent.
Here is a breakdown of the code:df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id)
: This code is similar to the previous code, where it filters thedf_receipt
data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column.df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd)) %>% slice_max(come_days, n = 20, with_ties = FALSE)
: This code summarizes thedf_data
data frame by counting the number of distinct "sales_ymd" values for each customer using then_distinct()
function. The resulting data frame,df_cnt
, contains the total number of unique sales days for each customer. Theslice_max()
function is used to select the top 20 rows based on the "come_days" column, wherewith_ties = FALSE
specifies that only the top 20 distinct values should be selected.df_sum <- df_data %>% summarise(sum_amount = sum(amount)) %>% slice_max(sum_amount, n = 20, with_ties = FALSE)
: This code summarizes thedf_data
data frame by calculating the sum of the "amount" column for each customer using thesum()
function. The resulting data frame,df_sum
, contains the total amount spent by each customer. Theslice_max()
function is used to select the top 20 rows based on the "sum_amount" column, wherewith_ties = FALSE
specifies that only the top 20 distinct values should be selected.full_join(df_cnt, df_sum, by = "customer_id")
: This code performs a full join operation betweendf_cnt
anddf_sum
based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent, obtained using theslice_max()
function instead of thearrange()
function.
R-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.
df_store_tmp <- df_store
df_product_tmp <- df_product
df_store_tmp["key"] <- 0
df_product_tmp["key"] <- 0
nrow(full_join(df_store_tmp, df_product_tmp, by = "key"))
531590
Commentary :
This code performs a full join operation between thedf_store
anddf_product
data frames based on a common "key" column that is added to both data frames. The resulting data frame is then used to calculate the number of rows using thenrow()
function.
Here is a breakdown of the code:df_store_tmp <- df_store
: This code creates a new data frame calleddf_store_tmp
that is a copy of thedf_store
data frame. This is done to avoid modifying the original data frame.df_product_tmp <- df_product
: This code creates a new data frame calleddf_product_tmp
that is a copy of thedf_product
data frame. This is done to avoid modifying the original data frame.df_store_tmp["key"] <- 0
: This code adds a new column called "key" to thedf_store_tmp
data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.df_product_tmp["key"] <- 0
: This code adds a new column called "key" to thedf_product_tmp
data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.full_join(df_store_tmp, df_product_tmp, by = "key")
: This code performs a full join operation between thedf_store_tmp
anddf_product_tmp
data frames based on the common "key" column. Since both data frames have the same value in this column for all rows, a full join will result in a Cartesian product of the two data frames, meaning that every row fromdf_store_tmp
will be paired with every row fromdf_product_tmp
.nrow(full_join(df_store_tmp, df_product_tmp, by = "key"))
: This code calculates the number of rows in the resulting data frame from the full join operation using thenrow()
function. The result is the total number of combinations of rows betweendf_store
anddf_product
.
In summary, this code creates copies of thedf_store
anddf_product
data frames, adds a common "key" column to both data frames with all values set to zero, performs a full join operation between the two data frames based on the common "key" column, and calculates the number of resulting rows. The purpose of this code is to determine the total number of combinations of rows between thedf_store
anddf_product
data frames.
Comment