参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- To begin with, run the following cells
- Import the required libraries and read the data from the database (PostgreSQL)
- The libraries that are expected to be used are imported in the following cells
- If there are other libraries you wish to use, install them as appropriate using install.packages().
- Names, addresses, etc. are dummy data and are not real.
require("RPostgreSQL")
require("tidyr")
require("dplyr")
require("stringr")
require("caret")
require("lubridate")
require("rsample")
require("recipes")
require("themis")
host <- "db"
port <- Sys.getenv()["PG_PORT"]
dbname <- Sys.getenv()["PG_DATABASE"]
user <- Sys.getenv()["PG_USER"]
password <- Sys.getenv()["PG_PASSWORD"]
con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password)
df_customer <- dbGetQuery(con,"SELECT * FROM customer")
df_category <- dbGetQuery(con,"SELECT * FROM category")
df_product <- dbGetQuery(con,"SELECT * FROM product")
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt")
df_store <- dbGetQuery(con,"SELECT * FROM store")
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")
Loading required package: RPostgreSQL Loading required package: DBI Loading required package: tidyr Loading required package: dplyr Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: stringr Loading required package: caret Loading required package: ggplot2 Loading required package: lattice Loading required package: lubridate Attaching package: ‘lubridate’ The following objects are masked from ‘package:base’: date, intersect, setdiff, union Loading required package: rsample Loading required package: recipes Attaching package: ‘recipes’ The following object is masked from ‘package:stringr’: fixed The following object is masked from ‘package:stats’: step Loading required package: themis Registered S3 methods overwritten by 'themis': method from bake.step_downsample recipes bake.step_upsample recipes prep.step_downsample recipes prep.step_upsample recipes tidy.step_downsample recipes tidy.step_upsample recipes tunable.step_downsample recipes tunable.step_upsample recipes Attaching package: ‘themis’ The following objects are masked from ‘package:recipes’: step_downsample, step_upsample
Practice Questions
R-041: Sum up the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd) and calculate the increase or decrease in sales amount from the date of the last sale. And display 10 results.
df_receipt %>%
group_by(sales_ymd) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
mutate(lag_ymd = lag(sales_ymd), lag_amount = lag(sum_amount), diff_amount = sum_amount - lag_amount) %>%
slice(1:10)
sales_ymd | sum_amount | lag_ymd | lag_amount | diff_amount |
---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> |
20170101 | 33723 | NA | NA | NA |
20170102 | 24165 | 20170101 | 33723 | -9558 |
20170103 | 27503 | 20170102 | 24165 | 3338 |
20170104 | 36165 | 20170103 | 27503 | 8662 |
20170105 | 37830 | 20170104 | 36165 | 1665 |
20170106 | 32387 | 20170105 | 37830 | -5443 |
20170107 | 23415 | 20170106 | 32387 | -8972 |
20170108 | 24737 | 20170107 | 23415 | 1322 |
20170109 | 26718 | 20170108 | 24737 | 1981 |
20170110 | 20143 | 20170109 | 26718 | -6575 |
Commentary :
This code uses thedplyr
package in R to manipulate a data frame calleddf_receipt
. Here is a step-by-step explanation of what the code is doing:df_receipt %>%
- This is a pipe operator in R which passes thedf_receipt
data frame to the next function in the pipeline as its first argument.group_by(sales_ymd) %>%
- This groups the data frame by thesales_ymd
column, which contains dates or date-time data. The%>%
operator again pipes this grouped data frame to the next function in the pipeline.summarise(sum_amount = sum(amount), .groups = "drop") %>%
- This summarizes the grouped data frame by calculating the sum of theamount
column for each uniquesales_ymd
group. The resulting data frame will have two columns:sales_ymd
andsum_amount
. The.groups = "drop"
argument is used to drop the grouping information added bygroup_by
, since it is no longer needed. The%>%
operator again pipes this summarized data frame to the next function in the pipeline.mutate(lag_ymd = lag(sales_ymd), lag_amount = lag(sum_amount), diff_amount = sum_amount - lag_amount) %>%
- This adds three new columns to the summarized data frame. The first two columns,lag_ymd
andlag_amount
, contain the lagged values of thesales_ymd
andsum_amount
columns, respectively. The third column,diff_amount
, contains the difference between thesum_amount
andlag_amount
columns. The%>%
operator again pipes this modified data frame to the next function in the pipeline.slice(1:10)
- This selects the first 10 rows of the modified data frame. The resulting data frame will have the columnssales_ymd
,sum_amount
,lag_ymd
,lag_amount
, anddiff_amount
, and will contain the first 10 rows of the originaldf_receipt
data frame after it has been grouped, summarized, and modified.
Overall, this code performs some kind of time-series analysis on sales data, calculating the total amount of sales for each date, and then calculating the difference between each day's sales and the previous day's sales. The resulting data frame should give insight into the daily sales trends over time.
R-042: Tabulate the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd), and for each date, combine the data of the last, previous and three previous sales dates. Then display 10 results.
# Code example 1: Vertical holding case
df_sum <-df_receipt %>%
group_by(sales_ymd) %>%
summarise(sum_amount = sum(amount), .groups = "drop")
for (i in 1:3) {
df_tmp <- df_sum %>%
mutate(lag_ymd = lag(sales_ymd, n = i),
lag_amount = lag(sum_amount, n = i))
if (i == 1) {
df_lag <- df_tmp
}
else {
df_lag <- rbind(df_lag, df_tmp)
}
}
df_lag %>%
drop_na(everything()) %>%
arrange(sales_ymd, lag_ymd) %>%
slice(1:10)
sales_ymd | sum_amount | lag_ymd | lag_amount |
---|---|---|---|
<int> | <int> | <int> | <int> |
20170102 | 24165 | 20170101 | 33723 |
20170103 | 27503 | 20170101 | 33723 |
20170103 | 27503 | 20170102 | 24165 |
20170104 | 36165 | 20170101 | 33723 |
20170104 | 36165 | 20170102 | 24165 |
20170104 | 36165 | 20170103 | 27503 |
20170105 | 37830 | 20170102 | 24165 |
20170105 | 37830 | 20170103 | 27503 |
20170105 | 37830 | 20170104 | 36165 |
20170106 | 32387 | 20170103 | 27503 |
Commentary :
This code is also manipulating a data frame calleddf_receipt
using thedplyr
package in R. Here is a step-by-step explanation of what the code is doing:df_sum <-df_receipt %>% group_by(sales_ymd) %>% summarise(sum_amount = sum(amount), .groups = "drop")
- This is similar to the first line of code in the previous example, which groups thedf_receipt
data frame by thesales_ymd
column, calculates the sum of theamount
column for each uniquesales_ymd
group, and stores the resulting data frame in a new variable calleddf_sum
.for (i in 1:3) {
- This initiates afor
loop that will iterate 3 times.df_tmp <- df_sum %>% mutate(lag_ymd = lag(sales_ymd, n = i), lag_amount = lag(sum_amount, n = i))
- This creates a new temporary data frame calleddf_tmp
that is created by lagging thesales_ymd
andsum_amount
columns in thedf_sum
data frame byi
days using thelag()
function fromdplyr
. The resulting data frame will have three columns:sales_ymd
,sum_amount
,lag_ymd
, andlag_amount
.if (i == 1) { df_lag <- df_tmp }
- If this is the first iteration of the loop, thedf_lag
data frame is set equal todf_tmp
.else { df_lag <- rbind(df_lag, df_tmp) }
- If this is not the first iteration of the loop,df_tmp
is appended to the existingdf_lag
data frame usingrbind()
function fromdplyr
.}
- Thefor
loop ends here.df_lag %>% drop_na(everything()) %>% arrange(sales_ymd, lag_ymd) %>% slice(1:10)
- This takes thedf_lag
data frame that was created in the loop, drops any rows with missing data usingdrop_na()
, arranges the rows bysales_ymd
andlag_ymd
usingarrange()
, and then selects the first 10 rows of the resulting data frame usingslice()
.
The purpose of this code is to create a new data frame calleddf_lag
that contains the lagged values of thesales_ymd
andsum_amount
columns for each of the previous 3 days. The resulting data frame can be used to examine the sales trends over the previous 3 days, and may be useful in predicting future sales. The final line of code simply selects the first 10 rows of the resulting data frame for display or further analysis.
# Code example 2: Horizontal holding case
df_sum <-df_receipt %>%
group_by(sales_ymd) %>%
summarise(sum_amount = sum(amount), .groups = "drop")
for (i in 1:3) {
col_name_1 <- paste("lag_ymd_", i , sep="_")
col_name_2 <- paste("lag_amount", i , sep="_")
df_tmp <- df_sum %>%
mutate(!!col_name_1 := lag(sales_ymd, n = i),
!!col_name_2 := lag(sum_amount, n = i))
if (i == 1) {
df_lag <- df_tmp
}
else {
df_lag <- cbind(df_lag, df_tmp[c(col_name_1, col_name_2)])
}
}
df_lag %>%
drop_na(everything()) %>%
arrange(sales_ymd) %>%
slice(1:10)
sales_ymd | sum_amount | lag_ymd__1 | lag_amount_1 | lag_ymd__2 | lag_amount_2 | lag_ymd__3 | lag_amount_3 |
---|---|---|---|---|---|---|---|
<int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> |
20170104 | 36165 | 20170103 | 27503 | 20170102 | 24165 | 20170101 | 33723 |
20170105 | 37830 | 20170104 | 36165 | 20170103 | 27503 | 20170102 | 24165 |
20170106 | 32387 | 20170105 | 37830 | 20170104 | 36165 | 20170103 | 27503 |
20170107 | 23415 | 20170106 | 32387 | 20170105 | 37830 | 20170104 | 36165 |
20170108 | 24737 | 20170107 | 23415 | 20170106 | 32387 | 20170105 | 37830 |
20170109 | 26718 | 20170108 | 24737 | 20170107 | 23415 | 20170106 | 32387 |
20170110 | 20143 | 20170109 | 26718 | 20170108 | 24737 | 20170107 | 23415 |
20170111 | 24287 | 20170110 | 20143 | 20170109 | 26718 | 20170108 | 24737 |
20170112 | 23526 | 20170111 | 24287 | 20170110 | 20143 | 20170109 | 26718 |
20170113 | 28004 | 20170112 | 23526 | 20170111 | 24287 | 20170110 | 20143 |
Commentary :
This code is also manipulating a data frame calleddf_receipt
using thedplyr
package in R. Here is a step-by-step explanation of what the code is doing:df_sum <-df_receipt %>% group_by(sales_ymd) %>% summarise(sum_amount = sum(amount), .groups = "drop")
- This is similar to the first line of code in the previous examples, which groups thedf_receipt
data frame by thesales_ymd
column, calculates the sum of theamount
column for each uniquesales_ymd
group, and stores the resulting data frame in a new variable calleddf_sum
.for (i in 1:3) {
- This initiates afor
loop that will iterate 3 times.col_name_1 <- paste("lag_ymd_", i , sep="_")
- This creates a new character string variable calledcol_name_1
that concatenates the prefix "lag_ymd_" with the current value ofi
.col_name_2 <- paste("lag_amount", i , sep="_")
- This creates a new character string variable calledcol_name_2
that concatenates the prefix "lag_amount_" with the current value ofi
.df_tmp <- df_sum %>% mutate(!!col_name_1 := lag(sales_ymd, n = i), !!col_name_2 := lag(sum_amount, n = i))
- This creates a new temporary data frame calleddf_tmp
that is created by lagging thesales_ymd
andsum_amount
columns in thedf_sum
data frame byi
days using thelag()
function fromdplyr
. However, instead of adding new columns with static names like in the previous example, this code dynamically creates column names by using the string variables created in steps 3 and 4. The!!
operator is used to unquote the variable names so that they are evaluated as column names.if (i == 1) { df_lag <- df_tmp }
- If this is the first iteration of the loop, thedf_lag
data frame is set equal todf_tmp
.else { df_lag <- cbind(df_lag, df_tmp[c(col_name_1, col_name_2)]) }
- If this is not the first iteration of the loop, thedf_tmp
data frame is appended to the existingdf_lag
data frame usingcbind()
. However, instead of selecting specific columns like in the previous example, this code dynamically selects the columns whose names match the values ofcol_name_1
andcol_name_2
.}
- Thefor
loop ends here.df_lag %>% drop_na(everything()) %>% arrange(sales_ymd) %>% slice(1:10)
- This takes thedf_lag
data frame that was created in the loop, drops any rows with missing data usingdrop_na()
, arranges the rows bysales_ymd
usingarrange()
, and then selects the first 10 rows of the resulting data frame usingslice()
.
The purpose of this code is similar to the previous example, which is to create a new data frame calleddf_lag
that contains the lagged values of thesales_ymd
andsum_amount
columns for each of the previous 3 days. However, instead of hard-coding column names, this code dynamically generates
R-043: Combine receipt details data (df_receipt) and customer data (df_customer) to create sales summary data, summing the sales amount (amount) by gender code (gender_cd) and age (calculated from age). Gender_cd should be 0 for male, 1 for female and 9 for unknown.
However, the item structure should consist of four items: age, female sales amount, male sales amount and sales amount of unknown gender (vertical cross tabulation of age and horizontal cross tabulation of gender). In addition, the age shall be in deciles of 10 years.
df_sales_summary <- df_customer[c("customer_id", "gender_cd", "age")] %>%
mutate(era = trunc(age / 10) * 10) %>%
inner_join(df_receipt, by = "customer_id") %>%
group_by(gender_cd, era) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
spread( gender_cd, sum_amount, fill = 0) %>%
rename(male = "0", female = "1", unknown = "9")
df_sales_summary
era | male | female | unknown |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
10 | 1591 | 149836 | 4317 |
20 | 72940 | 1363724 | 44328 |
30 | 177322 | 693047 | 50441 |
40 | 19355 | 9320791 | 483512 |
50 | 54320 | 6685192 | 342923 |
60 | 272469 | 987741 | 71418 |
70 | 13435 | 29764 | 2427 |
80 | 46360 | 262923 | 5111 |
90 | 0 | 6260 | 0 |
Commentary :
This code creates a summary of sales amounts by customer gender and age era (i.e., decade of age), using data from two data frames:df_customer
anddf_receipt
.
First, the code selects three columns fromdf_customer
(customer_id
,gender_cd
, andage
) and creates a new variableera
by rounding down age to the nearest decade.
Next, it joins thedf_receipt
data frame bycustomer_id
, creating a new data frame with information about customer gender, age era, and sales amounts.
The data is then grouped bygender_cd
andera
, and the total sales amount is calculated usingsummarise()
.spread()
is used to pivot the data so that there is one row for each era, with columns for total sales amounts for each gender.
Finally,rename()
is used to give more descriptive names to the columns representing male, female, and unknown gender.
The result is a data frame calleddf_sales_summary
that summarizes total sales amounts by gender and age era.
R-044: The sales summary data (df_sales_summary) created in 043 had sales by gender carried horizontally. From this data, convert the gender into three items: age, gender code and sales amount, by having the gender carried vertically. The gender code should be “00” for male, “01” for female and “99” for unknown.
df_sales_summary %>%
gather(key = gender_cd, value = sum_amount, male, female, unknown) %>%
mutate(gender_cd = case_when(
gender_cd == "male" ~ "00",
gender_cd == "female" ~ "01",
gender_cd == "unknown" ~ "99",
))
era | gender_cd | sum_amount |
---|---|---|
<dbl> | <chr> | <dbl> |
10 | 00 | 1591 |
20 | 00 | 72940 |
30 | 00 | 177322 |
40 | 00 | 19355 |
50 | 00 | 54320 |
60 | 00 | 272469 |
70 | 00 | 13435 |
80 | 00 | 46360 |
90 | 00 | 0 |
10 | 01 | 149836 |
20 | 01 | 1363724 |
30 | 01 | 693047 |
40 | 01 | 9320791 |
50 | 01 | 6685192 |
60 | 01 | 987741 |
70 | 01 | 29764 |
80 | 01 | 262923 |
90 | 01 | 6260 |
10 | 99 | 4317 |
20 | 99 | 44328 |
30 | 99 | 50441 |
40 | 99 | 483512 |
50 | 99 | 342923 |
60 | 99 | 71418 |
70 | 99 | 2427 |
80 | 99 | 5111 |
90 | 99 | 0 |
Commentary :
This code modifies thedf_sales_summary
data frame that was created in the previous code block.gather()
function is used to transform the data from wide format to long format by stacking the columnsmale
,female
, andunknown
into a single column calledsum_amount
, and creating a new column calledgender_cd
to store the previous column names (male
,female
, orunknown
).mutate()
function is used to recode the values in thegender_cd
column. It uses thecase_when()
function to create a new column with three possible values: "00" for male, "01" for female, and "99" for unknown gender.
Overall, this code is preparing thedf_sales_summary
data frame for merging with other data frames, by reshaping it into long format and recoding the gender values to a standard format.
R-045: The date of birth (birth_day) in the customer data (df_customer) is held in date type. Convert this to a string in YYYYMMDD format and display 10 cases together with the customer ID (customer_id).
df_tmp <- cbind(df_customer["customer_id"],
strftime(
df_customer$birth_day,
format = "%Y%m%d" ))
colnames(df_tmp) <- c("customer_id", "birth_day")
head(df_tmp,10)
customer_id | birth_day | |
---|---|---|
<chr> | <chr> | |
1 | CS021313000114 | 19810429 |
2 | CS037613000071 | 19520401 |
3 | CS031415000172 | 19761004 |
4 | CS028811000001 | 19330327 |
5 | CS001215000145 | 19950329 |
6 | CS020401000016 | 19740915 |
7 | CS015414000103 | 19770809 |
8 | CS029403000008 | 19730817 |
9 | CS015804000004 | 19310502 |
10 | CS033513000180 | 19620711 |
Commentary :
This code creates a new data frame calleddf_tmp
by selecting thecustomer_id
andbirth_day
columns from thedf_customer
data frame.
Thedf_customer["customer_id"]
syntax selects thecustomer_id
column as a data frame (not as a vector), which is necessary for combining it with another column usingcbind()
.
Thestrftime()
function is used to format thebirth_day
column as a character string in the%Y%m%d
format, which represents the year, month, and day of birth as a 8-digit number.colnames()
function is used to assign new column names to the data frame, so that the columns are named "customer_id" and "birth_day".
Finally,head()
function is used to show the first 10 rows of thedf_tmp
data frame.
Overall, this code creates a new data frame with two columns that can be used to merge with other data frames based on customer ID or birth date.
R-046: The application date (application_date) of the customer data (df_customer) holds data in a string type in YYYYMMDD format. Convert this to a date type and display 10 cases together with the customer ID (customer_id).
df_tmp <- cbind(df_customer["customer_id"],
strptime(df_customer$application_date, "%Y%m%d"))
colnames(df_tmp) <- c("customer_id", "application_date")
head(df_tmp, 10)
customer_id | application_date | |
---|---|---|
<chr> | <dttm> | |
1 | CS021313000114 | 2015-09-05 |
2 | CS037613000071 | 2015-04-14 |
3 | CS031415000172 | 2015-05-29 |
4 | CS028811000001 | 2016-01-15 |
5 | CS001215000145 | 2017-06-05 |
6 | CS020401000016 | 2015-02-25 |
7 | CS015414000103 | 2015-07-22 |
8 | CS029403000008 | 2015-05-15 |
9 | CS015804000004 | 2015-06-07 |
10 | CS033513000180 | 2015-07-28 |
Commentary :
This code creates a new data frame calleddf_tmp
by selecting thecustomer_id
andapplication_date
columns from thedf_customer
data frame.
Thedf_customer["customer_id"]
syntax selects thecustomer_id
column as a data frame (not as a vector), which is necessary for combining it with another column usingcbind()
.
Thestrptime()
function is used to convert theapplication_date
column, which is a character string in the%Y%m%d
format representing the year, month, and day of application as an 8-digit number, to a POSIXlt class object that represents date and time in R. The%Y%m%d
argument specifies the format of the character string to be converted.colnames()
function is used to assign new column names to the data frame, so that the columns are named "customer_id" and "application_date".
Finally,head()
function is used to show the first 10 rows of thedf_tmp
data frame.
Overall, this code creates a new data frame with two columns that can be used to merge with other data frames based on customer ID or application date.
R-047: The sales date (sales_ymd) of the receipt details data (df_receipt) holds data in numeric type in YYYYMMDD format. Convert this to a date type and display 10 items together with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], strptime(as.character(df_receipt$sales_ymd), "%Y%m%d"))
colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_ymd")
head(df_tmp, 10)
receipt_no | receipt_sub_no | sales_ymd | |
---|---|---|---|
<int> | <int> | <dttm> | |
1 | 112 | 1 | 2018-11-03 |
2 | 1132 | 2 | 2018-11-18 |
3 | 1102 | 1 | 2017-07-12 |
4 | 1132 | 1 | 2019-02-05 |
5 | 1102 | 2 | 2018-08-21 |
6 | 1112 | 1 | 2019-06-05 |
7 | 1102 | 2 | 2018-12-05 |
8 | 1102 | 1 | 2019-09-22 |
9 | 1112 | 2 | 2017-05-04 |
10 | 1102 | 1 | 2019-10-10 |
Commentary :
This code creates a new data frame calleddf_tmp
by selecting thereceipt_no
,receipt_sub_no
, andsales_ymd
columns from thedf_receipt
data frame, and converting thesales_ymd
column from an integer inyyyymmdd
format to a POSIXlt class object that represents date and time in R.
Thedf_receipt[c("receipt_no", "receipt_sub_no")]
syntax selects thereceipt_no
andreceipt_sub_no
columns as a data frame (not as a vector), which is necessary for combining them with another column usingcbind()
.as.character()
is used to convert thesales_ymd
column from an integer to a character vector before passing it to thestrptime()
function.
Thestrptime()
function is used to convert the character vectorsales_ymd
to a POSIXlt class object that represents date and time in R. The%Y%m%d
argument specifies the format of the character string to be converted.colnames()
function is used to assign new column names to the data frame, so that the columns are named "receipt_no", "receipt_sub_no", and "sales_ymd".
Finally,head()
function is used to show the first 10 rows of thedf_tmp
data frame.
Overall, this code creates a new data frame with three columns that can be used to merge with other data frames based on receipt number, receipt sub-number, or sales date.
R-048: The sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) holds data in UNIX seconds of numeric type. Convert it to date type and display 10 cases together with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], as.POSIXct(df_receipt$sales_epoch, origin="1970-01-01"))
colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_ymd")
head(df_tmp, 10)
receipt_no | receipt_sub_no | sales_ymd | |
---|---|---|---|
<int> | <int> | <dttm> | |
1 | 112 | 1 | 2018-11-03 |
2 | 1132 | 2 | 2018-11-18 |
3 | 1102 | 1 | 2017-07-12 |
4 | 1132 | 1 | 2019-02-05 |
5 | 1102 | 2 | 2018-08-21 |
6 | 1112 | 1 | 2019-06-05 |
7 | 1102 | 2 | 2018-12-05 |
8 | 1102 | 1 | 2019-09-22 |
9 | 1112 | 2 | 2017-05-04 |
10 | 1102 | 1 | 2019-10-10 |
Commentary :
This code creates a new data framedf_tmp
by combining selected columns from the original data framedf_receipt
. Specifically, it selects the columns "receipt_no", "receipt_sub_no", and "sales_epoch" fromdf_receipt
and binds them together with the functioncbind()
.
The third column "sales_epoch" contains the number of seconds since January 1, 1970, known as the Unix epoch. Theas.POSIXct()
function is then used to convert this numerical value to a POSIXct object, which represents a date and time in R. Theorigin
argument is set to "1970-01-01" to indicate that the epoch starts at midnight on January 1, 1970.
Finally, the functioncolnames()
is used to rename the columns ofdf_tmp
to "receipt_no", "receipt_sub_no", and "sales_ymd", respectively.
The resultingdf_tmp
data frame contains the same two columns asdf_receipt
("receipt_no" and "receipt_sub_no"), plus a new column "sales_ymd" with the converted date and time information from "sales_epoch". Thehead()
function is used to display the first 10 rows ofdf_tmp
.
R-049: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “年(year)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], substring(as.POSIXct(df_receipt$sales_epoch, origin = "1970-01-01"), 1, 4))
colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_year")
head(df_tmp, 10)
receipt_no | receipt_sub_no | sales_year | |
---|---|---|---|
<int> | <int> | <chr> | |
1 | 112 | 1 | 2018 |
2 | 1132 | 2 | 2018 |
3 | 1102 | 1 | 2017 |
4 | 1132 | 1 | 2019 |
5 | 1102 | 2 | 2018 |
6 | 1112 | 1 | 2019 |
7 | 1102 | 2 | 2018 |
8 | 1102 | 1 | 2019 |
9 | 1112 | 2 | 2017 |
10 | 1102 | 1 | 2019 |
Commentary :
This code is creating a new data frame nameddf_tmp
by binding three columns together from thedf_receipt
data frame. The first two columns,receipt_no
andreceipt_sub_no
, are being selected directly fromdf_receipt
.
The third column,sales_year
, is being created using thesubstring()
function to extract the first 4 characters from the output ofas.POSIXct()
, which is being used to convert thesales_epoch
column to a date/time format.
Theas.POSIXct()
function is using theorigin
parameter to specify that the Unix epoch time (1970-01-01
) is being used as the reference point. Thesubstring()
function is then used to extract the year from this date/time format.
Finally,colnames()
is used to assign the column names "receipt_no", "receipt_sub_no", and "sales_year" to the corresponding columns in the new data frame. Thehead()
function is then used to display the first 10 rows of the new data frame.
R-050: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “月(month)” and display 10 items with receipt number (receipt_no) and receipt sub number (receipt_sub_no). Note that the “月(month)” should be retrieved with two digits filled with zero.
df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], substring(as.POSIXct(df_receipt$sales_epoch, origin = "1970-01-01"), 6, 7))
colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_ymd")
head(df_tmp, 10)
receipt_no | receipt_sub_no | sales_ymd | |
---|---|---|---|
<int> | <int> | <chr> | |
1 | 112 | 1 | 11 |
2 | 1132 | 2 | 11 |
3 | 1102 | 1 | 07 |
4 | 1132 | 1 | 02 |
5 | 1102 | 2 | 08 |
6 | 1112 | 1 | 06 |
7 | 1102 | 2 | 12 |
8 | 1102 | 1 | 09 |
9 | 1112 | 2 | 05 |
10 | 1102 | 1 | 10 |
Commentary :
This code creates a new data framedf_tmp
that contains the columns "receipt_no", "receipt_sub_no", and "sales_ymd". The "sales_ymd" column is derived from the "sales_epoch" column indf_receipt
.
Specifically, thesubstring()
function is used to extract the month component from the "sales_epoch" column by specifying the starting position as 6 (which corresponds to the first character of the month component) and the ending position as 7 (which corresponds to the last character of the month component). The resulting character string represents the month of the sales date.
The new data framedf_tmp
is created by combining the selected columns fromdf_receipt
with the extracted month component using thecbind()
function. Finally, the column names ofdf_tmp
are updated to "receipt_no", "receipt_sub_no", and "sales_ymd".
Overall, this code extracts the month of each sales date from the "sales_epoch" column ofdf_receipt
and adds it as a new column named "sales_ymd" to a new data framedf_tmp
.
R-051: Convert the sales epoch seconds of the receipt details data (df_receipt) to date type, extract only the “日(day)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no). Note that the “日(day)” should be extracted with two digits filled with zero.
df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], substring(as.POSIXct(df_receipt$sales_epoch, origin = "1970-01-01"), 9, 10))
colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_day")
head(df_tmp, 10)
receipt_no | receipt_sub_no | sales_day | |
---|---|---|---|
<int> | <int> | <chr> | |
1 | 112 | 1 | 03 |
2 | 1132 | 2 | 18 |
3 | 1102 | 1 | 12 |
4 | 1132 | 1 | 05 |
5 | 1102 | 2 | 21 |
6 | 1112 | 1 | 05 |
7 | 1102 | 2 | 05 |
8 | 1102 | 1 | 22 |
9 | 1112 | 2 | 04 |
10 | 1102 | 1 | 10 |
Commentary :
This code creates a new data frame calleddf_tmp
that has three columns - "receipt_no", "receipt_sub_no", and "sales_day". The "receipt_no" and "receipt_sub_no" columns are copied from the "df_receipt" data frame. The "sales_day" column is created by extracting the day from the "sales_epoch" column of "df_receipt" data frame.
The "sales_epoch" column is first converted to a POSIXct object using theas.POSIXct
function with the origin set to "1970-01-01". This is a common approach to converting Unix timestamps, which are measured in seconds since January 1, 1970, into R's datetime format. Thesubstring
function is then used to extract the day portion of the datetime object. Thesubstring
function takes three arguments - the input string, the starting position of the substring, and the ending position of the substring. In this case, the substring starts at position 9 (which is the day portion of the datetime string) and ends at position 10.
Finally, the column names are updated to "receipt_no", "receipt_sub_no", and "sales_day" using thecolnames
function. The resulting data framedf_tmp
shows the "receipt_no", "receipt_sub_no", and "sales_day" columns for the first 10 rows of data.
R-052: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), binaryise the total sales amount to 0 for amounts less than 2,000 yen and 1 for amounts greater than 2,000 yen, and display 10 items with the customer ID and total sales amount. Display 10 items together with the customer ID and the total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount=sum(amount), .groups = "drop") %>%
mutate(sales_flg = ifelse(sum_amount > 2000, 1, 0)) %>%
slice(1:10)
customer_id | sum_amount | sales_flg |
---|---|---|
<chr> | <int> | <dbl> |
CS001113000004 | 1298 | 0 |
CS001114000005 | 626 | 0 |
CS001115000010 | 3044 | 1 |
CS001205000004 | 1988 | 0 |
CS001205000006 | 3337 | 1 |
CS001211000025 | 456 | 0 |
CS001212000027 | 448 | 0 |
CS001212000031 | 296 | 0 |
CS001212000046 | 228 | 0 |
CS001212000070 | 456 | 0 |
Commentary :
This code performs the following operations on thedf_receipt
dataframe:
Filters out all the rows where thecustomer_id
column starts with the letter "Z".Groups
the remaining rows bycustomer_id
.
Calculates the sum of theamount
column for eachcustomer_id
.
Creates a new columnsales_flg
that is set to 1 if the sum of theamount
column is greater than 2000, and 0 otherwise.
Selects the first 10 rows of the resulting dataframe.
The%>%
operator is used to chain the operations together, making the code more concise and easier to read.
Here's a step-by-step breakdown of what's happening:filter(!grepl("^Z", customer_id))
: Thefilter()
function removes all rows where thecustomer_id
column starts with the letter "Z". The!grepl()
function returns a logical vector indicating whether each element in thecustomer_id
column does not match the pattern^Z
(i.e., does not start with the letter "Z"). The!
operator negates the logical vector, so thatfilter()
keeps only the rows where the vector isTRUE
.group_by(customer_id)
: Thegroup_by()
function groups the remaining rows bycustomer_id
.summarise(sum_amount=sum(amount), .groups = "drop")
: Thesummarise()
function calculates the sum of theamount
column for eachcustomer_id
, and stores the result in a new column calledsum_amount
. The.groups = "drop"
argument tellssummarise()
to drop the grouping structure after summarizing the data.mutate(sales_flg = ifelse(sum_amount > 2000, 1, 0))
: Themutate()
function creates a new column calledsales_flg
, which is set to 1 if thesum_amount
column is greater than 2000, and 0 otherwise. Theifelse()
function is a vectorized version of the standardif...else
statement, which returns a vector of the same length as the condition vector, with elements selected from either theyes
orno
vectors depending on whether the element of the condition vector isTRUE
orFALSE
.slice(1:10)
: Theslice()
function selects the first 10 rows of the resulting dataframe.
R-053: Binarise the postcode (postal_cd) of the customer data (df_customer) to 1 for Tokyo (the first three digits are 100 to 209) and 0 for all others. Further, combine it with the receipt details data (df_receipt) and count the number of customers with sales results for the entire period, for each binary value created.
df_customer[c("customer_id", "postal_cd")] %>%
mutate(postal_flg =
ifelse(100 <= as.integer(str_sub(postal_cd, start = 1, end = 3))
& as.integer(str_sub(postal_cd, start = 1, end = 3)) <= 209, 1, 0)) %>% inner_join(df_receipt, by = "customer_id") %>%
group_by(postal_flg) %>%
summarise(customer_cnt = n_distinct(customer_id), .groups = "drop")
postal_flg | customer_cnt |
---|---|
<dbl> | <int> |
0 | 3906 |
1 | 4400 |
Commentary :
This code performs the following tasks:
Selects the columns "customer_id" and "postal_cd" from the data framedf_customer
.
Creates a new column "postal_flg" that is set to 1 if the first 3 characters of "postal_cd" represent a postal code between 100 and 209 (inclusive) and 0 otherwise. Theifelse
function is used to perform this check.
Joins this modifieddf_customer
data frame with thedf_receipt
data frame on the "customer_id" column.
Groups the resulting data frame by the "postal_flg" column.
Calculates the number of unique customer IDs for each value of "postal_flg" using then_distinct
function.
Drops the grouping information using the ".groups = 'drop'" argument.
Returns the resulting data frame with columns "postal_flg" and "customer_cnt".
Overall, this code is calculating the number of customers indf_receipt
that have postal codes within the range of 100-209, grouped by whether or not they have a postal code in that range.
R-054: The address of the customer data (df_customer) is one of 埼玉県(Saitama-prefecture), 千葉県(Chiba-prefecture), 東京都(Tokyo-Metropolitan) or 神奈川県(Kanagawa-prefecture). Create a code value for each prefecture, and display 10 cases together with the customer ID and address. The values should be 11 for 埼玉県(Saitama-prefecture), 12 for 千葉県(Chiba-prefecture), 13 for 東京都(Tokyo-Metropolitan) and 14 for 神奈川県(Kanagawa-prefecture).
# Code example 1 (cut out at fixed)
df_customer %>%
mutate(prefecture_cd =
case_when(
str_sub(address, start = 1, end = 3) == "埼玉県" ~ "11",
str_sub(address, start = 1, end = 3) == "千葉県" ~ "12",
str_sub(address, start = 1, end = 3) == "東京都" ~ "13",
str_sub(address, start = 1, end = 3) == "神奈川" ~ "14")) %>% select(customer_id, address, prefecture_cd) %>%
slice(1:10)
customer_id | address | prefecture_cd |
---|---|---|
<chr> | <chr> | <chr> |
CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
CS037613000071 | 東京都江東区南砂********** | 13 |
CS031415000172 | 東京都渋谷区代々木********** | 13 |
CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
CS001215000145 | 東京都大田区仲六郷********** | 13 |
CS020401000016 | 東京都板橋区若木********** | 13 |
CS015414000103 | 東京都江東区北砂********** | 13 |
CS029403000008 | 千葉県浦安市海楽********** | 12 |
CS015804000004 | 東京都江東区北砂********** | 13 |
CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This code performs the following actions:
It takes a data framedf_customer
as input.
It creates a new columnprefecture_cd
in thedf_customer
data frame.
The values of theprefecture_cd
column are based on the values of theaddress
column.
If the first three characters of theaddress
column are "埼玉県", the value of theprefecture_cd
column is set to "11".
If the first three characters of theaddress
column are "千葉県", the value of theprefecture_cd
column is set to "12".
If the first three characters of theaddress
column are "東京都", the value of theprefecture_cd
column is set to "13".
If the first three characters of theaddress
column are "神奈川", the value of theprefecture_cd
column is set to "14".
Thecustomer_id
,address
, andprefecture_cd
columns are selected for the output.
Theslice(1:10)
function is used to show the first 10 rows of the resulting data frame.
# Code example 2 (using regular expressions)
df_customer %>%
mutate(prefecture_cd =
case_when(
str_extract(address, pattern = "^.*?[都道府県]") == "埼玉県" ~ "11",
str_extract(address, pattern = "^.*?[都道府県]") == "千葉県" ~ "12",
str_extract(address, pattern = "^.*?[都道府県]") == "東京都" ~ "13",
str_extract(address, pattern = "^.*?[都道府県]") == "神奈川県" ~ "14")) %>% select(customer_id, address, prefecture_cd) %>%
slice(1:10)
customer_id | address | prefecture_cd |
---|---|---|
<chr> | <chr> | <chr> |
CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
CS037613000071 | 東京都江東区南砂********** | 13 |
CS031415000172 | 東京都渋谷区代々木********** | 13 |
CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
CS001215000145 | 東京都大田区仲六郷********** | 13 |
CS020401000016 | 東京都板橋区若木********** | 13 |
CS015414000103 | 東京都江東区北砂********** | 13 |
CS029403000008 | 千葉県浦安市海楽********** | 12 |
CS015804000004 | 東京都江東区北砂********** | 13 |
CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This code uses thedplyr
package in R to create a new columnprefecture_cd
in thedf_customer
data frame that maps prefecture names to prefecture codes for a subset of prefectures in Japan. Specifically, it uses themutate()
function to add a new column to the data frame and thecase_when()
function to map prefecture names to codes based on a regular expression pattern matching.
Here is a line-by-line explanation of the code:df_customer %>%
starts a pipeline with thedf_customer
data frame as the input.mutate(prefecture_cd =
creates a new columnprefecture_cd
in the data frame.case_when(
starts a series of conditions to map prefecture names to codes.str_extract(address, pattern = "^.*?[都道府県]") == "埼玉県" ~ "11",
matches the regular expression pattern "^.*?[都道府県]" to extract the prefecture name from theaddress
column and maps "埼玉県" to "11".str_extract(address, pattern = "^.*?[都道府県]") == "千葉県" ~ "12",
does the same for "千葉県" and maps it to "12".str_extract(address, pattern = "^.*?[都道府県]") == "東京都" ~ "13",
does the same for "東京都" and maps it to "13".str_extract(address, pattern = "^.*?[都道府県]") == "神奈川県" ~ "14"))
does the same for "神奈川県" and maps it to "14".select(customer_id, address, prefecture_cd)
selects only the columnscustomer_id
,address
, andprefecture_cd
from the data frame.slice(1:10)
selects only the first 10 rows of the resulting data frame.
R-055: Sum the sales amount (amount) of the receipt details (df_receipt) data for each customer ID (customer_id) and find the quartile point of the total amount. Then, create category values for the total sales amount for each customer based on the following criteria, and display the 10 results together with the customer ID and the total sales amount. The category values shall be 1-4 in order.
- Minimum value and above and below the first quartile – Assign 1.
- First quartile and above but below the second quartile – Assign 2.
- Second quartile and above but below the third quartile – Assign 3.
- Third quartile and above – Assign 4.
df_receipt %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
mutate(pct_group = case_when(
sum_amount < quantile(sum_amount)[2] ~ "1",
sum_amount < quantile(sum_amount)[3] ~ "2",
sum_amount < quantile(sum_amount)[4] ~ "3",
quantile(sum_amount)[4] <= sum_amount ~ "4"
)) %>%
slice(1:10)
customer_id | sum_amount | pct_group |
---|---|---|
<chr> | <int> | <chr> |
CS001113000004 | 1298 | 2 |
CS001114000005 | 626 | 2 |
CS001115000010 | 3044 | 3 |
CS001205000004 | 1988 | 3 |
CS001205000006 | 3337 | 3 |
CS001211000025 | 456 | 1 |
CS001212000027 | 448 | 1 |
CS001212000031 | 296 | 1 |
CS001212000046 | 228 | 1 |
CS001212000070 | 456 | 1 |
Commentary :
This code performs the following operations:
It takes a dataframedf_receipt
.
It groups the rows indf_receipt
bycustomer_id
.
For each group of rows with the samecustomer_id
, it calculates the total sum of theamount
column using thesum()
function, and stores the result in a new column calledsum_amount
.
It drops the grouping structure using.groups = "drop"
.
It creates a new columnpct_group
using thecase_when()
function, which assigns each row to a quartile based on the value of thesum_amount
column. The quartiles are determined using thequantile()
function.
Finally, it selects the first 10 rows of the resulting dataframe and stores it in a new dataframe calleddf_tmp
.
For example, ifquantile(sum_amount)
returns[100, 200, 300, 400]
, then the resultingpct_group
column will have the value "1" for rows wheresum_amount
is less than 200, "2" for rows wheresum_amount
is between 200 and 299, "3" for rows wheresum_amount
is between 300 and 399, and "4" for rows wheresum_amount
is greater than or equal to 400.
R-056: Calculate the age in 10-year age increments based on the age (age) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id) and date of birth (birth_day). However, all age groups above 60 years should be assumed to be in the 60s. The name of the category representing the age is optional.
# Code example 1
df_customer[c("customer_id", "birth_day", "age")] %>%
mutate(era = trunc(age / 10) * 10) %>%
mutate(era = case_when( era < 60 ~ era, era >= 60 ~ 60 )) %>%
slice(1:10)
customer_id | birth_day | age | era |
---|---|---|---|
<chr> | <date> | <int> | <dbl> |
CS021313000114 | 1981-04-29 | 37 | 30 |
CS037613000071 | 1952-04-01 | 66 | 60 |
CS031415000172 | 1976-10-04 | 42 | 40 |
CS028811000001 | 1933-03-27 | 86 | 60 |
CS001215000145 | 1995-03-29 | 24 | 20 |
CS020401000016 | 1974-09-15 | 44 | 40 |
CS015414000103 | 1977-08-09 | 41 | 40 |
CS029403000008 | 1973-08-17 | 45 | 40 |
CS015804000004 | 1931-05-02 | 87 | 60 |
CS033513000180 | 1962-07-11 | 56 | 50 |
Commentary :
This code manipulates a data framedf_customer
containing customer data. The code selects three columns:customer_id
,birth_day
, andage
. Then, it calculates the era of each customer based on their age.
The firstmutate
function calculates the era of each customer by dividing their age by 10, truncating the result to the nearest integer, and then multiplying the integer by 10. For example, if a customer is 43 years old, their era will be 40. If a customer is 25 years old, their era will be 20.
The secondmutate
function replaces era values greater than or equal to 60 with 60. This is done to group all customers aged 60 or older into a single era category.
Finally, the code selects the first 10 rows of the resulting data frame, which contains the customer ID, birth date, age, and calculated era for each customer.
# Code example 2 (when using min)
df_customer[c("customer_id", "birth_day","age")] %>%
mutate(era = sapply(1:nrow(df_customer),
function(x){
return (min(trunc(.[x, 3] / 10) * 10, 60))
})) %>%
slice(1:10)
customer_id | birth_day | age | era |
---|---|---|---|
<chr> | <date> | <int> | <dbl> |
CS021313000114 | 1981-04-29 | 37 | 30 |
CS037613000071 | 1952-04-01 | 66 | 60 |
CS031415000172 | 1976-10-04 | 42 | 40 |
CS028811000001 | 1933-03-27 | 86 | 60 |
CS001215000145 | 1995-03-29 | 24 | 20 |
CS020401000016 | 1974-09-15 | 44 | 40 |
CS015414000103 | 1977-08-09 | 41 | 40 |
CS029403000008 | 1973-08-17 | 45 | 40 |
CS015804000004 | 1931-05-02 | 87 | 60 |
CS033513000180 | 1962-07-11 | 56 | 50 |
Commentary :
This code is used to calculate the "era" for each customer based on their birth date. Here's what each line does:
Select only the "customer_id", "birth_day", and "age" columns from the df_customer data frame.
Use the "mutate" function to add a new column called "era". The value of this column is calculated by dividing the "age" column by 10, taking the floor (i.e., rounding down to the nearest integer), and then multiplying by 10. This effectively rounds the age down to the nearest multiple of 10, giving an approximate "era" for each customer.
Use the "mutate" function again to adjust the era values. If the era is less than 60, it is left unchanged. If the era is greater than or equal to 60, it is set to 60.
Use the "slice" function to select the first 10 rows of the resulting data frame.
Note that this code uses the "sapply" function to apply the calculation to each row of the data frame. While this works, it can be slow for very large data frames. A more efficient approach would be to use vectorized operations instead.
R-057: Using the extraction results of 056 and the gender code (gender_cd), create new category data representing the combination of gender x age and display 10 cases. The value of the category representing the combination shall be arbitrary.
# Generate a sex and age code by concatenating a one-digit sex code and a two-digit age code.
df_customer[c("customer_id", "gender_cd", "birth_day", "age")] %>%
mutate(era = trunc(age / 10) * 10) %>%
mutate(era = case_when( era < 60 ~ formatC(era, width = 2,flag = "0"),
era >= 60 ~ formatC(60, width = 2,flag = "0"))) %>%
mutate(gender_era = paste(gender_cd, era, sep = "")) %>%
slice(1:10)
customer_id | gender_cd | birth_day | age | era | gender_era |
---|---|---|---|---|---|
<chr> | <chr> | <date> | <int> | <chr> | <chr> |
CS021313000114 | 1 | 1981-04-29 | 37 | 30 | 130 |
CS037613000071 | 9 | 1952-04-01 | 66 | 60 | 960 |
CS031415000172 | 1 | 1976-10-04 | 42 | 40 | 140 |
CS028811000001 | 1 | 1933-03-27 | 86 | 60 | 160 |
CS001215000145 | 1 | 1995-03-29 | 24 | 20 | 120 |
CS020401000016 | 0 | 1974-09-15 | 44 | 40 | 040 |
CS015414000103 | 1 | 1977-08-09 | 41 | 40 | 140 |
CS029403000008 | 0 | 1973-08-17 | 45 | 40 | 040 |
CS015804000004 | 0 | 1931-05-02 | 87 | 60 | 060 |
CS033513000180 | 1 | 1962-07-11 | 56 | 50 | 150 |
Commentary :
This code manipulates a data frame calleddf_customer
, which presumably contains information about customers. Here is an explanation of each line:df_customer[c("customer_id", "gender_cd", "birth_day", "age")]
: Selects the columns "customer_id", "gender_cd", "birth_day", and "age" from the data framedf_customer
.%>%
: The pipe operator used to chain the data frame to the next operation.mutate(era = trunc(age / 10) * 10)
: Adds a new column called "era" to the data frame, which is calculated by taking the age of each customer, dividing by 10, rounding down to the nearest integer, and multiplying by 10. This effectively bins the customers into 10-year age ranges.mutate(era = case_when( era < 60 ~ formatC(era, width = 2,flag = "0"), era >= 60 ~ formatC(60, width = 2,flag = "0")))
: Modifies the "era" column by applying acase_when
statement, which converts the "era" values that are less than 60 to a two-digit string with a leading zero (e.g., "10" for ages 10-19), and sets the "era" values that are greater than or equal to 60 to "60".mutate(gender_era = paste(gender_cd, era, sep = ""))
: Adds a new column called "gender_era" to the data frame, which is created by concatenating the "gender_cd" and "era" columns using thepaste
function. Thesep = ""
argument specifies that there should be no separator between the two values.slice(1:10)
: Selects the first 10 rows of the data frame.
R-058: Create a dummy variable for the gender code (gender_cd) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id).
# Code example 1 (all code values itemised)
dummy_gender_model <- dummyVars(~gender_cd, data = df_customer, fullRank = FALSE)
dummy_gender <- predict(dummy_gender_model, df_customer)
head(cbind(df_customer["customer_id"], dummy_gender), 10)
customer_id | gender_cd0 | gender_cd1 | gender_cd9 | |
---|---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> | |
1 | CS021313000114 | 0 | 1 | 0 |
2 | CS037613000071 | 0 | 0 | 1 |
3 | CS031415000172 | 0 | 1 | 0 |
4 | CS028811000001 | 0 | 1 | 0 |
5 | CS001215000145 | 0 | 1 | 0 |
6 | CS020401000016 | 1 | 0 | 0 |
7 | CS015414000103 | 0 | 1 | 0 |
8 | CS029403000008 | 1 | 0 | 0 |
9 | CS015804000004 | 1 | 0 | 0 |
10 | CS033513000180 | 0 | 1 | 0 |
Commentary :
This code creates dummy variables for a categorical variable "gender_cd" in the data framedf_customer
. It first creates a model object using thedummyVars
function from the "caret" package. ThefullRank
argument is set toFALSE
to remove the first level of the categorical variable (which can be inferred from the other levels).
The second line applies the model to the original data framedf_customer
to create a new data frame with the dummy variables. Thepredict
function with thedummy_gender_model
as the first argument anddf_customer
as the second argument applies the model to the data frame to create the dummy variables.
Finally, the code combines thecustomer_id
column fromdf_customer
with the dummy variables using thecbind
function and displays the first 10 rows usinghead()
.
# Code example 2 (one item can be deleted by setting fullRank=TRUE)
dummy_gender_model <- dummyVars(~gender_cd, data = df_customer, fullRank = TRUE)
dummy_gender <- predict(dummy_gender_model, df_customer)
head(cbind(df_customer["customer_id"], dummy_gender), 3)
customer_id | gender_cd1 | gender_cd9 | |
---|---|---|---|
<chr> | <dbl> | <dbl> | |
1 | CS021313000114 | 1 | 0 |
2 | CS037613000071 | 0 | 1 |
3 | CS031415000172 | 1 | 0 |
Commentary :
This code performs one-hot encoding on thegender_cd
column of thedf_customer
dataframe using thedummyVars()
andpredict()
functions from thecaret
package.
ThedummyVars()
function creates a set of dummy variables, representing the categories of the categorical variable specified in the formula. In this case, the formula~gender_cd
specifies that thegender_cd
variable should be used to create the dummy variables. ThefullRank = TRUE
argument specifies that the reference level should be included in the dummy variable encoding.
Thepredict()
function is then used to apply the encoding to the originaldf_customer
dataframe. The resulting encoded dataframe is stored in thedummy_gender
variable.
Finally, thehead()
function is used to display the first three rows of thedummy_gender
dataframe, along with thecustomer_id
column of the originaldf_customer
dataframe, which is appended usingcbind()
. This allows us to see the encoding of the first few customers. The dummy variable encoding for each customer will consist of a 1 in the column corresponding to their gender and a 0 in the column corresponding to the other gender.
R-059: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), standardise the total sales amount to mean 0 and standard deviation 1 and display 10 cases with the customer ID and total sales amount. The standard deviation used for standardisation may be either the square root of the variance or the square root of the unbiased variance. However, exclude customer IDs starting with “Z”, as they represent non-members.
# Standardised by standard deviation by square root of unbiased variance due to use of R scale
df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
mutate(std_amount = scale(sum_amount, center = TRUE, scale = TRUE)) %>%
slice(1:10)
customer_id | sum_amount | std_amount |
---|---|---|
<chr> | <int> | <dbl[,1]> |
CS001113000004 | 1298 | -0.4593502 |
CS001114000005 | 626 | -0.7063478 |
CS001115000010 | 3044 | 0.1824025 |
CS001205000004 | 1988 | -0.2057366 |
CS001205000006 | 3337 | 0.2900964 |
CS001211000025 | 456 | -0.7688324 |
CS001212000027 | 448 | -0.7717728 |
CS001212000031 | 296 | -0.8276413 |
CS001212000046 | 228 | -0.8526351 |
CS001212000070 | 456 | -0.7688324 |
Commentary :
This code performs the following steps:
Filter rows in thedf_receipt
dataframe where thecustomer_id
column does not start with the character "Z". This is done using thefilter
function from thedplyr
package and thegrepl
function which matches a regular expression pattern to thecustomer_id
column.
Group the resulting rows bycustomer_id
using thegroup_by
function.
Calculate the sum of theamount
column for each group using thesummarise
function fromdplyr
.
Create a new column calledstd_amount
that contains the standardized sum of amounts for each customer. Thescale
function is used to standardize the sum amounts by centering the data at the mean (center = TRUE
) and scaling it to have a standard deviation of 1 (scale = TRUE
).
Finally, theslice
function is used to select the first 10 rows of the resulting dataframe, including thecustomer_id
andstd_amount
columns.
R-060: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), normalise the total sales amount to a minimum value of 0 and a maximum value of 1 and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
df_receipt %>% filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
mutate( scale_amount = scale(sum_amount, center = min(sum_amount), scale = max(sum_amount) - min(sum_amount)))%>%
slice(1:10)
customer_id | sum_amount | scale_amount |
---|---|---|
<chr> | <int> | <dbl[,1]> |
CS001113000004 | 1298 | 0.053354188 |
CS001114000005 | 626 | 0.024157108 |
CS001115000010 | 3044 | 0.129214460 |
CS001205000004 | 1988 | 0.083333333 |
CS001205000006 | 3337 | 0.141944734 |
CS001211000025 | 456 | 0.016770942 |
CS001212000027 | 448 | 0.016423358 |
CS001212000031 | 296 | 0.009819256 |
CS001212000046 | 228 | 0.006864790 |
CS001212000070 | 456 | 0.016770942 |
Commentary :
This code performs the following operations on thedf_receipt
data frame:
Filter out any rows where thecustomer_id
starts with the letter "Z".
Group the remaining rows bycustomer_id
.
Calculate the sum ofamount
for each group ofcustomer_id
.
Mutate the resulting data frame to add a new columnscale_amount
which is a scaled version ofsum_amount
. Thescale
function scalessum_amount
so that its minimum value becomes 0 and its maximum value becomes 1.
Slice the resulting data frame to keep only the first 10 rows.
In summary, this code calculates the total amount of money spent by each customer indf_receipt
, and then scales those values so that the minimum value becomes 0 and the maximum value becomes 1. The resultingscale_amount
values can be useful for certain types of analysis, such as when you want to compare the relative spending patterns of different customers.
Comment