参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_R.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction

To begin with, run the following cells
Import the required libraries and read the data from the database (PostgreSQL)
The libraries that are expected to be used are imported in the following cells
If there are other libraries you wish to use, install them as appropriate using install.packages().
Names, addresses, etc. are dummy data and are not real.

In [ ]:

require("RPostgreSQL") 
require("tidyr") 
require("dplyr") 
require("stringr") 
require("caret") 
require("lubridate") 
require("rsample") 
require("recipes") 
require("themis") 

host <- "db" 
port <- Sys.getenv()["PG_PORT"] 
dbname <- Sys.getenv()["PG_DATABASE"] 
user <- Sys.getenv()["PG_USER"] 
password <- Sys.getenv()["PG_PASSWORD"] 

con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password) 
df_customer <- dbGetQuery(con,"SELECT * FROM customer") 
df_category <- dbGetQuery(con,"SELECT * FROM category") 
df_product <- dbGetQuery(con,"SELECT * FROM product") 
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt") 
df_store <- dbGetQuery(con,"SELECT * FROM store") 
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")
 

Loading required package: RPostgreSQL

Loading required package: DBI

Loading required package: tidyr

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: stringr

Loading required package: caret

Loading required package: ggplot2

Loading required package: lattice

Loading required package: lubridate


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


Loading required package: rsample

Loading required package: recipes


Attaching package: ‘recipes’


The following object is masked from ‘package:stringr’:

    fixed


The following object is masked from ‘package:stats’:

    step


Loading required package: themis

Registered S3 methods overwritten by 'themis':
  method                  from   
  bake.step_downsample    recipes
  bake.step_upsample      recipes
  prep.step_downsample    recipes
  prep.step_upsample      recipes
  tidy.step_downsample    recipes
  tidy.step_upsample      recipes
  tunable.step_downsample recipes
  tunable.step_upsample   recipes


Attaching package: ‘themis’


The following objects are masked from ‘package:recipes’:

    step_downsample, step_upsample

Practice Questions

R-041: Sum up the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd) and calculate the increase or decrease in sales amount from the date of the last sale. And display 10 results.

In [ ]:

df_receipt %>% 
 group_by(sales_ymd) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") %>% 
 mutate(lag_ymd = lag(sales_ymd), lag_amount = lag(sum_amount), diff_amount = sum_amount - lag_amount) %>% 

slice(1:10)

A tibble: 10 × 5

sales_ymd	sum_amount	lag_ymd	lag_amount	diff_amount
<int>	<int>	<int>	<int>	<int>
20170101	33723	NA	NA	NA
20170102	24165	20170101	33723	-9558
20170103	27503	20170102	24165	3338
20170104	36165	20170103	27503	8662
20170105	37830	20170104	36165	1665
20170106	32387	20170105	37830	-5443
20170107	23415	20170106	32387	-8972
20170108	24737	20170107	23415	1322
20170109	26718	20170108	24737	1981
20170110	20143	20170109	26718	-6575

Commentary :

This code uses the dplyr package in R to manipulate a data frame called df_receipt. Here is a step-by-step explanation of what the code is doing:

df_receipt %>% - This is a pipe operator in R which passes the df_receipt data frame to the next function in the pipeline as its first argument.

group_by(sales_ymd) %>% - This groups the data frame by the sales_ymd column, which contains dates or date-time data. The %>% operator again pipes this grouped data frame to the next function in the pipeline.

summarise(sum_amount = sum(amount), .groups = "drop") %>% - This summarizes the grouped data frame by calculating the sum of the amount column for each unique sales_ymd group. The resulting data frame will have two columns: sales_ymd and sum_amount. The .groups = "drop" argument is used to drop the grouping information added by group_by, since it is no longer needed. The %>% operator again pipes this summarized data frame to the next function in the pipeline.

mutate(lag_ymd = lag(sales_ymd), lag_amount = lag(sum_amount), diff_amount = sum_amount - lag_amount) %>% - This adds three new columns to the summarized data frame. The first two columns, lag_ymd and lag_amount, contain the lagged values of the sales_ymd and sum_amount columns, respectively. The third column, diff_amount, contains the difference between the sum_amount and lag_amount columns. The %>% operator again pipes this modified data frame to the next function in the pipeline.

slice(1:10) - This selects the first 10 rows of the modified data frame. The resulting data frame will have the columns sales_ymd, sum_amount, lag_ymd, lag_amount, and diff_amount, and will contain the first 10 rows of the original df_receipt data frame after it has been grouped, summarized, and modified.

Overall, this code performs some kind of time-series analysis on sales data, calculating the total amount of sales for each date, and then calculating the difference between each day's sales and the previous day's sales. The resulting data frame should give insight into the daily sales trends over time.

R-042: Tabulate the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd), and for each date, combine the data of the last, previous and three previous sales dates. Then display 10 results.

In [ ]:

# Code example 1: Vertical holding case 
df_sum <-df_receipt %>% 
 group_by(sales_ymd) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") 

 for (i in 1:3) { 
    df_tmp <- df_sum %>% 
        mutate(lag_ymd = lag(sales_ymd, n = i), 
        lag_amount = lag(sum_amount, n = i)) 
    if (i == 1) { 
        df_lag <- df_tmp 
    } 
    else { 
         df_lag <- rbind(df_lag, df_tmp) 
    } 
} 

df_lag %>% 
drop_na(everything()) %>% 
arrange(sales_ymd, lag_ymd) %>% 
slice(1:10)

A tibble: 10 × 4

sales_ymd	sum_amount	lag_ymd	lag_amount
<int>	<int>	<int>	<int>
20170102	24165	20170101	33723
20170103	27503	20170101	33723
20170103	27503	20170102	24165
20170104	36165	20170101	33723
20170104	36165	20170102	24165
20170104	36165	20170103	27503
20170105	37830	20170102	24165
20170105	37830	20170103	27503
20170105	37830	20170104	36165
20170106	32387	20170103	27503

Commentary :

This code is also manipulating a data frame called df_receipt using the dplyr package in R. Here is a step-by-step explanation of what the code is doing:

df_sum <-df_receipt %>% group_by(sales_ymd) %>% summarise(sum_amount = sum(amount), .groups = "drop") - This is similar to the first line of code in the previous example, which groups the df_receipt data frame by the sales_ymd column, calculates the sum of the amount column for each unique sales_ymd group, and stores the resulting data frame in a new variable called df_sum.

for (i in 1:3) { - This initiates a for loop that will iterate 3 times.

df_tmp <- df_sum %>% mutate(lag_ymd = lag(sales_ymd, n = i), lag_amount = lag(sum_amount, n = i)) - This creates a new temporary data frame called df_tmp that is created by lagging the sales_ymd and sum_amount columns in the df_sum data frame by i days using the lag() function from dplyr. The resulting data frame will have three columns: sales_ymd, sum_amount, lag_ymd, and lag_amount.

if (i == 1) { df_lag <- df_tmp } - If this is the first iteration of the loop, the df_lag data frame is set equal to df_tmp.

else { df_lag <- rbind(df_lag, df_tmp) } - If this is not the first iteration of the loop, df_tmp is appended to the existing df_lag data frame using rbind() function from dplyr.

} - The for loop ends here.

df_lag %>% drop_na(everything()) %>% arrange(sales_ymd, lag_ymd) %>% slice(1:10) - This takes the df_lag data frame that was created in the loop, drops any rows with missing data using drop_na(), arranges the rows by sales_ymd and lag_ymd using arrange(), and then selects the first 10 rows of the resulting data frame using slice().

The purpose of this code is to create a new data frame called df_lag that contains the lagged values of the sales_ymd and sum_amount columns for each of the previous 3 days. The resulting data frame can be used to examine the sales trends over the previous 3 days, and may be useful in predicting future sales. The final line of code simply selects the first 10 rows of the resulting data frame for display or further analysis.

In [ ]:

# Code example 2: Horizontal holding case 
df_sum <-df_receipt %>% 
 group_by(sales_ymd) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") 

 for (i in 1:3) { 
     col_name_1 <- paste("lag_ymd_", i , sep="_") 
     col_name_2 <- paste("lag_amount", i , sep="_") 
     df_tmp <- df_sum %>% 
         mutate(!!col_name_1 := lag(sales_ymd, n　=　i), 
                !!col_name_2 := lag(sum_amount, n　=　i)) 
     if (i == 1)　{ 
         df_lag <- df_tmp 
     } 
     else { 
         df_lag <- cbind(df_lag, df_tmp[c(col_name_1, col_name_2)]) 
     } 
} 

df_lag %>% 
drop_na(everything()) %>% 
arrange(sales_ymd) %>% 
slice(1:10)

A data.frame: 10 × 8

sales_ymd	sum_amount	lag_ymd__1	lag_amount_1	lag_ymd__2	lag_amount_2	lag_ymd__3	lag_amount_3
<int>	<int>	<int>	<int>	<int>	<int>	<int>	<int>
20170104	36165	20170103	27503	20170102	24165	20170101	33723
20170105	37830	20170104	36165	20170103	27503	20170102	24165
20170106	32387	20170105	37830	20170104	36165	20170103	27503
20170107	23415	20170106	32387	20170105	37830	20170104	36165
20170108	24737	20170107	23415	20170106	32387	20170105	37830
20170109	26718	20170108	24737	20170107	23415	20170106	32387
20170110	20143	20170109	26718	20170108	24737	20170107	23415
20170111	24287	20170110	20143	20170109	26718	20170108	24737
20170112	23526	20170111	24287	20170110	20143	20170109	26718
20170113	28004	20170112	23526	20170111	24287	20170110	20143

Commentary :

This code is also manipulating a data frame called df_receipt using the dplyr package in R. Here is a step-by-step explanation of what the code is doing:

df_sum <-df_receipt %>% group_by(sales_ymd) %>% summarise(sum_amount = sum(amount), .groups = "drop") - This is similar to the first line of code in the previous examples, which groups the df_receipt data frame by the sales_ymd column, calculates the sum of the amount column for each unique sales_ymd group, and stores the resulting data frame in a new variable called df_sum.

for (i in 1:3) { - This initiates a for loop that will iterate 3 times.

col_name_1 <- paste("lag_ymd_", i , sep="_") - This creates a new character string variable called col_name_1 that concatenates the prefix "lag_ymd_" with the current value of i.

col_name_2 <- paste("lag_amount", i , sep="_") - This creates a new character string variable called col_name_2 that concatenates the prefix "lag_amount_" with the current value of i.

df_tmp <- df_sum %>% mutate(!!col_name_1 := lag(sales_ymd, n = i), !!col_name_2 := lag(sum_amount, n = i)) - This creates a new temporary data frame called df_tmp that is created by lagging the sales_ymd and sum_amount columns in the df_sum data frame by i days using the lag() function from dplyr. However, instead of adding new columns with static names like in the previous example, this code dynamically creates column names by using the string variables created in steps 3 and 4. The !! operator is used to unquote the variable names so that they are evaluated as column names.

if (i == 1)　{ df_lag <- df_tmp } - If this is the first iteration of the loop, the df_lag data frame is set equal to df_tmp.

else { df_lag <- cbind(df_lag, df_tmp[c(col_name_1, col_name_2)]) } - If this is not the first iteration of the loop, the df_tmp data frame is appended to the existing df_lag data frame using cbind(). However, instead of selecting specific columns like in the previous example, this code dynamically selects the columns whose names match the values of col_name_1 and col_name_2.

} - The for loop ends here.

df_lag %>% drop_na(everything()) %>% arrange(sales_ymd) %>% slice(1:10) - This takes the df_lag data frame that was created in the loop, drops any rows with missing data using drop_na(), arranges the rows by sales_ymd using arrange(), and then selects the first 10 rows of the resulting data frame using slice().

The purpose of this code is similar to the previous example, which is to create a new data frame called df_lag that contains the lagged values of the sales_ymd and sum_amount columns for each of the previous 3 days. However, instead of hard-coding column names, this code dynamically generates

R-043: Combine receipt details data (df_receipt) and customer data (df_customer) to create sales summary data, summing the sales amount (amount) by gender code (gender_cd) and age (calculated from age). Gender_cd should be 0 for male, 1 for female and 9 for unknown.

However, the item structure should consist of four items: age, female sales amount, male sales amount and sales amount of unknown gender (vertical cross tabulation of age and horizontal cross tabulation of gender). In addition, the age shall be in deciles of 10 years.

In [ ]:

df_sales_summary <- df_customer[c("customer_id", "gender_cd", "age")] %>% 
 mutate(era = trunc(age / 10) * 10) %>% 
 inner_join(df_receipt, by = "customer_id") %>% 
 group_by(gender_cd, era) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") %>% 
 spread( gender_cd, sum_amount, fill = 0) %>% 
rename(male = "0", female = "1", unknown = "9") 
df_sales_summary

A tibble: 9 × 4

era	male	female	unknown
<dbl>	<dbl>	<dbl>	<dbl>
10	1591	149836	4317
20	72940	1363724	44328
30	177322	693047	50441
40	19355	9320791	483512
50	54320	6685192	342923
60	272469	987741	71418
70	13435	29764	2427
80	46360	262923	5111
90	0	6260	0

Commentary :

This code creates a summary of sales amounts by customer gender and age era (i.e., decade of age), using data from two data frames: df_customer and df_receipt.

First, the code selects three columns from df_customer (customer_id, gender_cd, and age) and creates a new variable era by rounding down age to the nearest decade.

Next, it joins the df_receipt data frame by customer_id, creating a new data frame with information about customer gender, age era, and sales amounts.

The data is then grouped by gender_cd and era, and the total sales amount is calculated using summarise().

spread() is used to pivot the data so that there is one row for each era, with columns for total sales amounts for each gender.

Finally, rename() is used to give more descriptive names to the columns representing male, female, and unknown gender.

The result is a data frame called df_sales_summary that summarizes total sales amounts by gender and age era.

R-044: The sales summary data (df_sales_summary) created in 043 had sales by gender carried horizontally. From this data, convert the gender into three items: age, gender code and sales amount, by having the gender carried vertically. The gender code should be “00” for male, “01” for female and “99” for unknown.

In [ ]:

df_sales_summary　%>% 
 gather(key = gender_cd, value = sum_amount, male, female, unknown) %>% 
 mutate(gender_cd = case_when( 
 gender_cd == "male" ~ "00", 
 gender_cd == "female" ~ "01", 
 gender_cd == "unknown" ~ "99", 
))

A tibble: 27 × 3

era	gender_cd	sum_amount
<dbl>	<chr>	<dbl>
10	00	1591
20	00	72940
30	00	177322
40	00	19355
50	00	54320
60	00	272469
70	00	13435
80	00	46360
90	00	0
10	01	149836
20	01	1363724
30	01	693047
40	01	9320791
50	01	6685192
60	01	987741
70	01	29764
80	01	262923
90	01	6260
10	99	4317
20	99	44328
30	99	50441
40	99	483512
50	99	342923
60	99	71418
70	99	2427
80	99	5111
90	99	0

Commentary :

This code modifies the df_sales_summary data frame that was created in the previous code block.

gather() function is used to transform the data from wide format to long format by stacking the columns male, female, and unknown into a single column called sum_amount, and creating a new column called gender_cd to store the previous column names (male, female, or unknown).

mutate() function is used to recode the values in the gender_cd column. It uses the case_when() function to create a new column with three possible values: "00" for male, "01" for female, and "99" for unknown gender.

Overall, this code is preparing the df_sales_summary data frame for merging with other data frames, by reshaping it into long format and recoding the gender values to a standard format.

R-045: The date of birth (birth_day) in the customer data (df_customer) is held in date type. Convert this to a string in YYYYMMDD format and display 10 cases together with the customer ID (customer_id).

In [ ]:

df_tmp <- cbind(df_customer["customer_id"], 
               strftime( 
                   df_customer$birth_day, 
                   format = "%Y%m%d" )) 

colnames(df_tmp) <- c("customer_id", "birth_day") 

head(df_tmp,10)

A data.frame: 10 × 2

	customer_id	birth_day
	<chr>	<chr>
1	CS021313000114	19810429
2	CS037613000071	19520401
3	CS031415000172	19761004
4	CS028811000001	19330327
5	CS001215000145	19950329
6	CS020401000016	19740915
7	CS015414000103	19770809
8	CS029403000008	19730817
9	CS015804000004	19310502
10	CS033513000180	19620711

Commentary :

This code creates a new data frame called df_tmp by selecting the customer_id and birth_day columns from the df_customer data frame.

The df_customer["customer_id"] syntax selects the customer_id column as a data frame (not as a vector), which is necessary for combining it with another column using cbind().

The strftime() function is used to format the birth_day column as a character string in the %Y%m%d format, which represents the year, month, and day of birth as a 8-digit number.

colnames() function is used to assign new column names to the data frame, so that the columns are named "customer_id" and "birth_day".

Finally, head() function is used to show the first 10 rows of the df_tmp data frame.

Overall, this code creates a new data frame with two columns that can be used to merge with other data frames based on customer ID or birth date.

R-046: The application date (application_date) of the customer data (df_customer) holds data in a string type in YYYYMMDD format. Convert this to a date type and display 10 cases together with the customer ID (customer_id).

In [ ]:

df_tmp <- cbind(df_customer["customer_id"], 
strptime(df_customer$application_date, "%Y%m%d")) 

colnames(df_tmp) <- c("customer_id", "application_date") 

head(df_tmp, 10)

A data.frame: 10 × 2

	customer_id	application_date
	<chr>	<dttm>
1	CS021313000114	2015-09-05
2	CS037613000071	2015-04-14
3	CS031415000172	2015-05-29
4	CS028811000001	2016-01-15
5	CS001215000145	2017-06-05
6	CS020401000016	2015-02-25
7	CS015414000103	2015-07-22
8	CS029403000008	2015-05-15
9	CS015804000004	2015-06-07
10	CS033513000180	2015-07-28

Commentary :

This code creates a new data frame called df_tmp by selecting the customer_id and application_date columns from the df_customer data frame.

The df_customer["customer_id"] syntax selects the customer_id column as a data frame (not as a vector), which is necessary for combining it with another column using cbind().

The strptime() function is used to convert the application_date column, which is a character string in the %Y%m%d format representing the year, month, and day of application as an 8-digit number, to a POSIXlt class object that represents date and time in R. The %Y%m%d argument specifies the format of the character string to be converted.

colnames() function is used to assign new column names to the data frame, so that the columns are named "customer_id" and "application_date".

Finally, head() function is used to show the first 10 rows of the df_tmp data frame.

Overall, this code creates a new data frame with two columns that can be used to merge with other data frames based on customer ID or application date.

R-047: The sales date (sales_ymd) of the receipt details data (df_receipt) holds data in numeric type in YYYYMMDD format. Convert this to a date type and display 10 items together with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], strptime(as.character(df_receipt$sales_ymd), "%Y%m%d")) 

colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_ymd") 

head(df_tmp, 10)

A data.frame: 10 × 3

	receipt_no	receipt_sub_no	sales_ymd
	<int>	<int>	<dttm>
1	112	1	2018-11-03
2	1132	2	2018-11-18
3	1102	1	2017-07-12
4	1132	1	2019-02-05
5	1102	2	2018-08-21
6	1112	1	2019-06-05
7	1102	2	2018-12-05
8	1102	1	2019-09-22
9	1112	2	2017-05-04
10	1102	1	2019-10-10

Commentary :

This code creates a new data frame called df_tmp by selecting the receipt_no, receipt_sub_no, and sales_ymd columns from the df_receipt data frame, and converting the sales_ymd column from an integer in yyyymmdd format to a POSIXlt class object that represents date and time in R.

The df_receipt[c("receipt_no", "receipt_sub_no")] syntax selects the receipt_no and receipt_sub_no columns as a data frame (not as a vector), which is necessary for combining them with another column using cbind().

as.character() is used to convert the sales_ymd column from an integer to a character vector before passing it to the strptime() function.

The strptime() function is used to convert the character vector sales_ymd to a POSIXlt class object that represents date and time in R. The %Y%m%d argument specifies the format of the character string to be converted.

colnames() function is used to assign new column names to the data frame, so that the columns are named "receipt_no", "receipt_sub_no", and "sales_ymd".

Finally, head() function is used to show the first 10 rows of the df_tmp data frame.

Overall, this code creates a new data frame with three columns that can be used to merge with other data frames based on receipt number, receipt sub-number, or sales date.

R-048: The sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) holds data in UNIX seconds of numeric type. Convert it to date type and display 10 cases together with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], as.POSIXct(df_receipt$sales_epoch, origin="1970-01-01")) 

colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_ymd") 

head(df_tmp, 10)

A data.frame: 10 × 3

	receipt_no	receipt_sub_no	sales_ymd
	<int>	<int>	<dttm>
1	112	1	2018-11-03
2	1132	2	2018-11-18
3	1102	1	2017-07-12
4	1132	1	2019-02-05
5	1102	2	2018-08-21
6	1112	1	2019-06-05
7	1102	2	2018-12-05
8	1102	1	2019-09-22
9	1112	2	2017-05-04
10	1102	1	2019-10-10

Commentary :

This code creates a new data frame df_tmp by combining selected columns from the original data frame df_receipt. Specifically, it selects the columns "receipt_no", "receipt_sub_no", and "sales_epoch" from df_receipt and binds them together with the function cbind().

The third column "sales_epoch" contains the number of seconds since January 1, 1970, known as the Unix epoch. The as.POSIXct() function is then used to convert this numerical value to a POSIXct object, which represents a date and time in R. The origin argument is set to "1970-01-01" to indicate that the epoch starts at midnight on January 1, 1970.

Finally, the function colnames() is used to rename the columns of df_tmp to "receipt_no", "receipt_sub_no", and "sales_ymd", respectively.

The resulting df_tmp data frame contains the same two columns as df_receipt ("receipt_no" and "receipt_sub_no"), plus a new column "sales_ymd" with the converted date and time information from "sales_epoch". The head() function is used to display the first 10 rows of df_tmp.

R-049: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “年(year)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], substring(as.POSIXct(df_receipt$sales_epoch, origin = "1970-01-01"), 1, 4)) 

colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_year") 

head(df_tmp, 10)

A data.frame: 10 × 3

	receipt_no	receipt_sub_no	sales_year
	<int>	<int>	<chr>
1	112	1	2018
2	1132	2	2018
3	1102	1	2017
4	1132	1	2019
5	1102	2	2018
6	1112	1	2019
7	1102	2	2018
8	1102	1	2019
9	1112	2	2017
10	1102	1	2019

Commentary :

This code is creating a new data frame named df_tmp by binding three columns together from the df_receipt data frame. The first two columns, receipt_no and receipt_sub_no, are being selected directly from df_receipt.

The third column, sales_year, is being created using the substring() function to extract the first 4 characters from the output of as.POSIXct(), which is being used to convert the sales_epoch column to a date/time format.

The as.POSIXct() function is using the origin parameter to specify that the Unix epoch time (1970-01-01) is being used as the reference point. The substring() function is then used to extract the year from this date/time format.

Finally, colnames() is used to assign the column names "receipt_no", "receipt_sub_no", and "sales_year" to the corresponding columns in the new data frame. The head() function is then used to display the first 10 rows of the new data frame.

R-050: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “月(month)” and display 10 items with receipt number (receipt_no) and receipt sub number (receipt_sub_no). Note that the “月(month)” should be retrieved with two digits filled with zero.

In [ ]:

df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], substring(as.POSIXct(df_receipt$sales_epoch, origin = "1970-01-01"), 6, 7)) 

colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_ymd") 

head(df_tmp, 10)

A data.frame: 10 × 3

	receipt_no	receipt_sub_no	sales_ymd
	<int>	<int>	<chr>
1	112	1	11
2	1132	2	11
3	1102	1	07
4	1132	1	02
5	1102	2	08
6	1112	1	06
7	1102	2	12
8	1102	1	09
9	1112	2	05
10	1102	1	10

Commentary :

This code creates a new data frame df_tmp that contains the columns "receipt_no", "receipt_sub_no", and "sales_ymd". The "sales_ymd" column is derived from the "sales_epoch" column in df_receipt.

Specifically, the substring() function is used to extract the month component from the "sales_epoch" column by specifying the starting position as 6 (which corresponds to the first character of the month component) and the ending position as 7 (which corresponds to the last character of the month component). The resulting character string represents the month of the sales date.

The new data frame df_tmp is created by combining the selected columns from df_receipt with the extracted month component using the cbind() function. Finally, the column names of df_tmp are updated to "receipt_no", "receipt_sub_no", and "sales_ymd".

Overall, this code extracts the month of each sales date from the "sales_epoch" column of df_receipt and adds it as a new column named "sales_ymd" to a new data frame df_tmp.

R-051: Convert the sales epoch seconds of the receipt details data (df_receipt) to date type, extract only the “日(day)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no). Note that the “日(day)” should be extracted with two digits filled with zero.

In [ ]:

df_tmp <- cbind(df_receipt[c("receipt_no", "receipt_sub_no")], substring(as.POSIXct(df_receipt$sales_epoch, origin = "1970-01-01"), 9, 10)) 

colnames(df_tmp) <- c("receipt_no", "receipt_sub_no", "sales_day") 

head(df_tmp, 10)

A data.frame: 10 × 3

	receipt_no	receipt_sub_no	sales_day
	<int>	<int>	<chr>
1	112	1	03
2	1132	2	18
3	1102	1	12
4	1132	1	05
5	1102	2	21
6	1112	1	05
7	1102	2	05
8	1102	1	22
9	1112	2	04
10	1102	1	10

Commentary :

This code creates a new data frame called df_tmp that has three columns - "receipt_no", "receipt_sub_no", and "sales_day". The "receipt_no" and "receipt_sub_no" columns are copied from the "df_receipt" data frame. The "sales_day" column is created by extracting the day from the "sales_epoch" column of "df_receipt" data frame.

The "sales_epoch" column is first converted to a POSIXct object using the as.POSIXct function with the origin set to "1970-01-01". This is a common approach to converting Unix timestamps, which are measured in seconds since January 1, 1970, into R's datetime format. The substring function is then used to extract the day portion of the datetime object. The substring function takes three arguments - the input string, the starting position of the substring, and the ending position of the substring. In this case, the substring starts at position 9 (which is the day portion of the datetime string) and ends at position 10.

Finally, the column names are updated to "receipt_no", "receipt_sub_no", and "sales_day" using the colnames function. The resulting data frame df_tmp shows the "receipt_no", "receipt_sub_no", and "sales_day" columns for the first 10 rows of data.

R-052: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), binaryise the total sales amount to 0 for amounts less than 2,000 yen and 1 for amounts greater than 2,000 yen, and display 10 items with the customer ID and total sales amount. Display 10 items together with the customer ID and the total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_receipt %>% 
 filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount=sum(amount), .groups = "drop") %>% 
 mutate(sales_flg = ifelse(sum_amount > 2000, 1, 0))　%>% 
 slice(1:10)

A tibble: 10 × 3

customer_id	sum_amount	sales_flg
<chr>	<int>	<dbl>
CS001113000004	1298	0
CS001114000005	626	0
CS001115000010	3044	1
CS001205000004	1988	0
CS001205000006	3337	1
CS001211000025	456	0
CS001212000027	448	0
CS001212000031	296	0
CS001212000046	228	0
CS001212000070	456	0

Commentary :

This code performs the following operations on the df_receipt dataframe:

Filters out all the rows where the customer_id column starts with the letter "Z".Groups 

the remaining rows by customer_id.

Calculates the sum of the amount column for each customer_id.

Creates a new column sales_flg that is set to 1 if the sum of the amount column is greater than 2000, and 0 otherwise.

Selects the first 10 rows of the resulting dataframe.

The %>% operator is used to chain the operations together, making the code more concise and easier to read.

Here's a step-by-step breakdown of what's happening:

filter(!grepl("^Z", customer_id)): The filter() function removes all rows where the customer_id column starts with the letter "Z". The !grepl() function returns a logical vector indicating whether each element in the customer_id column does not match the pattern ^Z (i.e., does not start with the letter "Z"). The ! operator negates the logical vector, so that filter() keeps only the rows where the vector is TRUE.

group_by(customer_id): The group_by() function groups the remaining rows by customer_id.

summarise(sum_amount=sum(amount), .groups = "drop"): The summarise() function calculates the sum of the amount column for each customer_id, and stores the result in a new column called sum_amount. The .groups = "drop" argument tells summarise() to drop the grouping structure after summarizing the data.

mutate(sales_flg = ifelse(sum_amount > 2000, 1, 0)): The mutate() function creates a new column called sales_flg, which is set to 1 if the sum_amount column is greater than 2000, and 0 otherwise. The ifelse() function is a vectorized version of the standard if...else statement, which returns a vector of the same length as the condition vector, with elements selected from either the yes or no vectors depending on whether the element of the condition vector is TRUE or FALSE.

slice(1:10): The slice() function selects the first 10 rows of the resulting dataframe.

R-053: Binarise the postcode (postal_cd) of the customer data (df_customer) to 1 for Tokyo (the first three digits are 100 to 209) and 0 for all others. Further, combine it with the receipt details data (df_receipt) and count the number of customers with sales results for the entire period, for each binary value created.

In [ ]:

df_customer[c("customer_id", "postal_cd")] %>% 
 mutate(postal_flg = 
        ifelse(100 <= as.integer(str_sub(postal_cd, start = 1, end = 3)) 
        & as.integer(str_sub(postal_cd, start = 1, end = 3)) <= 209, 1, 0)) %>% inner_join(df_receipt, by = "customer_id") %>% 
group_by(postal_flg) %>% 
summarise(customer_cnt = n_distinct(customer_id), .groups = "drop")

A tibble: 2 × 2

postal_flg	customer_cnt
<dbl>	<int>
0	3906
1	4400

Commentary :

This code performs the following tasks:

Selects the columns "customer_id" and "postal_cd" from the data frame df_customer.

Creates a new column "postal_flg" that is set to 1 if the first 3 characters of "postal_cd" represent a postal code between 100 and 209 (inclusive) and 0 otherwise. The ifelse function is used to perform this check.

Joins this modified df_customer data frame with the df_receipt data frame on the "customer_id" column.

Groups the resulting data frame by the "postal_flg" column.

Calculates the number of unique customer IDs for each value of "postal_flg" using the n_distinct function.

Drops the grouping information using the ".groups = 'drop'" argument.

Returns the resulting data frame with columns "postal_flg" and "customer_cnt".

Overall, this code is calculating the number of customers in df_receipt that have postal codes within the range of 100-209, grouped by whether or not they have a postal code in that range.

R-054: The address of the customer data (df_customer) is one of 埼玉県(Saitama-prefecture), 千葉県(Chiba-prefecture), 東京都(Tokyo-Metropolitan) or 神奈川県(Kanagawa-prefecture). Create a code value for each prefecture, and display 10 cases together with the customer ID and address. The values should be 11 for 埼玉県(Saitama-prefecture), 12 for 千葉県(Chiba-prefecture), 13 for 東京都(Tokyo-Metropolitan) and 14 for 神奈川県(Kanagawa-prefecture).

In [ ]:

# Code example 1 (cut out at fixed) 
df_customer %>% 
    mutate(prefecture_cd = 
           case_when( 
               str_sub(address, start = 1, end = 3) == "埼玉県" ~ "11", 
               str_sub(address, start = 1, end = 3) == "千葉県" ~ "12", 
               str_sub(address, start = 1, end = 3) == "東京都" ~ "13", 
               str_sub(address, start = 1, end = 3) == "神奈川" ~ "14")) %>% select(customer_id, address, prefecture_cd) %>% 
slice(1:10)

A data.frame: 10 × 3

customer_id	address	prefecture_cd
<chr>	<chr>	<chr>
CS021313000114	神奈川県伊勢原市粟窪**********	14
CS037613000071	東京都江東区南砂**********	13
CS031415000172	東京都渋谷区代々木**********	13
CS028811000001	神奈川県横浜市泉区和泉町**********	14
CS001215000145	東京都大田区仲六郷**********	13
CS020401000016	東京都板橋区若木**********	13
CS015414000103	東京都江東区北砂**********	13
CS029403000008	千葉県浦安市海楽**********	12
CS015804000004	東京都江東区北砂**********	13
CS033513000180	神奈川県横浜市旭区善部町**********	14

Commentary :

This code performs the following actions:

It takes a data frame df_customer as input.

It creates a new column prefecture_cd in the df_customer data frame.

The values of the prefecture_cd column are based on the values of the address column.

If the first three characters of the address column are "埼玉県", the value of the prefecture_cd column is set to "11".

If the first three characters of the address column are "千葉県", the value of the prefecture_cd column is set to "12".

If the first three characters of the address column are "東京都", the value of the prefecture_cd column is set to "13".

If the first three characters of the address column are "神奈川", the value of the prefecture_cd column is set to "14".

The customer_id, address, and prefecture_cd columns are selected for the output.

The slice(1:10) function is used to show the first 10 rows of the resulting data frame.

In [ ]:

# Code example 2 (using regular expressions) 
df_customer %>% 
 mutate(prefecture_cd = 
 case_when( 
   str_extract(address, pattern = "^.*?[都道府県]") == "埼玉県" ~ "11", 
   str_extract(address, pattern = "^.*?[都道府県]") == "千葉県" ~ "12", 
   str_extract(address, pattern = "^.*?[都道府県]") == "東京都" ~ "13", 
   str_extract(address, pattern = "^.*?[都道府県]") == "神奈川県" ~ "14")) %>% select(customer_id, address, prefecture_cd) %>% 
slice(1:10)

A data.frame: 10 × 3

customer_id	address	prefecture_cd
<chr>	<chr>	<chr>
CS021313000114	神奈川県伊勢原市粟窪**********	14
CS037613000071	東京都江東区南砂**********	13
CS031415000172	東京都渋谷区代々木**********	13
CS028811000001	神奈川県横浜市泉区和泉町**********	14
CS001215000145	東京都大田区仲六郷**********	13
CS020401000016	東京都板橋区若木**********	13
CS015414000103	東京都江東区北砂**********	13
CS029403000008	千葉県浦安市海楽**********	12
CS015804000004	東京都江東区北砂**********	13
CS033513000180	神奈川県横浜市旭区善部町**********	14

Commentary :

This code uses the dplyr package in R to create a new column prefecture_cd in the df_customer data frame that maps prefecture names to prefecture codes for a subset of prefectures in Japan. Specifically, it uses the mutate() function to add a new column to the data frame and the case_when() function to map prefecture names to codes based on a regular expression pattern matching.

Here is a line-by-line explanation of the code:

df_customer %>% starts a pipeline with the df_customer data frame as the input.

mutate(prefecture_cd = creates a new column prefecture_cd in the data frame.

case_when( starts a series of conditions to map prefecture names to codes.

str_extract(address, pattern = "^.*?[都道府県]") == "埼玉県" ~ "11", matches the regular expression pattern "^.*?[都道府県]" to extract the prefecture name from the address column and maps "埼玉県" to "11".

str_extract(address, pattern = "^.*?[都道府県]") == "千葉県" ~ "12", does the same for "千葉県" and maps it to "12".

str_extract(address, pattern = "^.*?[都道府県]") == "東京都" ~ "13", does the same for "東京都" and maps it to "13".

str_extract(address, pattern = "^.*?[都道府県]") == "神奈川県" ~ "14")) does the same for "神奈川県" and maps it to "14".

select(customer_id, address, prefecture_cd) selects only the columns customer_id, address, and prefecture_cd from the data frame.

slice(1:10) selects only the first 10 rows of the resulting data frame.

R-055: Sum the sales amount (amount) of the receipt details (df_receipt) data for each customer ID (customer_id) and find the quartile point of the total amount. Then, create category values for the total sales amount for each customer based on the following criteria, and display the 10 results together with the customer ID and the total sales amount. The category values shall be 1-4 in order.

Minimum value and above and below the first quartile – Assign 1.

First quartile and above but below the second quartile – Assign 2.

Second quartile and above but below the third quartile – Assign 3.

Third quartile and above – Assign 4.

In [ ]:

df_receipt %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") %>% 
 mutate(pct_group = case_when( 
    sum_amount < quantile(sum_amount)[2] ~ "1", 
    sum_amount < quantile(sum_amount)[3] ~ "2", 
    sum_amount < quantile(sum_amount)[4] ~ "3", 
    quantile(sum_amount)[4] <= sum_amount ~ "4" 
    )) %>% 
    slice(1:10)

A tibble: 10 × 3

customer_id	sum_amount	pct_group
<chr>	<int>	<chr>
CS001113000004	1298	2
CS001114000005	626	2
CS001115000010	3044	3
CS001205000004	1988	3
CS001205000006	3337	3
CS001211000025	456	1
CS001212000027	448	1
CS001212000031	296	1
CS001212000046	228	1
CS001212000070	456	1

Commentary :

This code performs the following operations:

It takes a dataframe df_receipt.

It groups the rows in df_receipt by customer_id.

For each group of rows with the same customer_id, it calculates the total sum of the amount column using the sum() function, and stores the result in a new column called sum_amount.

It drops the grouping structure using .groups = "drop".

It creates a new column pct_group using the case_when() function, which assigns each row to a quartile based on the value of the sum_amount column. The quartiles are determined using the quantile() function.

Finally, it selects the first 10 rows of the resulting dataframe and stores it in a new dataframe called df_tmp.

For example, if quantile(sum_amount) returns [100, 200, 300, 400], then the resulting pct_group column will have the value "1" for rows where sum_amount is less than 200, "2" for rows where sum_amount is between 200 and 299, "3" for rows where sum_amount is between 300 and 399, and "4" for rows where sum_amount is greater than or equal to 400.

R-056: Calculate the age in 10-year age increments based on the age (age) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id) and date of birth (birth_day). However, all age groups above 60 years should be assumed to be in the 60s. The name of the category representing the age is optional.

In [ ]:

# Code example 1 

df_customer[c("customer_id", "birth_day", "age")] %>% 
mutate(era = trunc(age / 10) * 10) %>% 
mutate(era = case_when( era < 60 ~ era, era >= 60 ~　60 )) %>% 
slice(1:10)

A data.frame: 10 × 4

customer_id	birth_day	age	era
<chr>	<date>	<int>	<dbl>
CS021313000114	1981-04-29	37	30
CS037613000071	1952-04-01	66	60
CS031415000172	1976-10-04	42	40
CS028811000001	1933-03-27	86	60
CS001215000145	1995-03-29	24	20
CS020401000016	1974-09-15	44	40
CS015414000103	1977-08-09	41	40
CS029403000008	1973-08-17	45	40
CS015804000004	1931-05-02	87	60
CS033513000180	1962-07-11	56	50

Commentary :

This code manipulates a data frame df_customer containing customer data. The code selects three columns: customer_id, birth_day, and age. Then, it calculates the era of each customer based on their age.

The first mutate function calculates the era of each customer by dividing their age by 10, truncating the result to the nearest integer, and then multiplying the integer by 10. For example, if a customer is 43 years old, their era will be 40. If a customer is 25 years old, their era will be 20.

The second mutate function replaces era values greater than or equal to 60 with 60. This is done to group all customers aged 60 or older into a single era category.

Finally, the code selects the first 10 rows of the resulting data frame, which contains the customer ID, birth date, age, and calculated era for each customer.

In [ ]:

# Code example 2 (when using min) 
df_customer[c("customer_id", "birth_day","age")] %>% 
mutate(era = sapply(1:nrow(df_customer), 
            function(x){ 
                return (min(trunc(.[x, 3] / 10) * 10, 60)) 
            })) %>% 
slice(1:10)

A data.frame: 10 × 4

customer_id	birth_day	age	era
<chr>	<date>	<int>	<dbl>
CS021313000114	1981-04-29	37	30
CS037613000071	1952-04-01	66	60
CS031415000172	1976-10-04	42	40
CS028811000001	1933-03-27	86	60
CS001215000145	1995-03-29	24	20
CS020401000016	1974-09-15	44	40
CS015414000103	1977-08-09	41	40
CS029403000008	1973-08-17	45	40
CS015804000004	1931-05-02	87	60
CS033513000180	1962-07-11	56	50

Commentary :

This code is used to calculate the "era" for each customer based on their birth date. Here's what each line does:

Select only the "customer_id", "birth_day", and "age" columns from the df_customer data frame.

Use the "mutate" function to add a new column called "era". The value of this column is calculated by dividing the "age" column by 10, taking the floor (i.e., rounding down to the nearest integer), and then multiplying by 10. This effectively rounds the age down to the nearest multiple of 10, giving an approximate "era" for each customer.

Use the "mutate" function again to adjust the era values. If the era is less than 60, it is left unchanged. If the era is greater than or equal to 60, it is set to 60.

Use the "slice" function to select the first 10 rows of the resulting data frame.

Note that this code uses the "sapply" function to apply the calculation to each row of the data frame. While this works, it can be slow for very large data frames. A more efficient approach would be to use vectorized operations instead.

R-057: Using the extraction results of 056 and the gender code (gender_cd), create new category data representing the combination of gender x age and display 10 cases. The value of the category representing the combination shall be arbitrary.

In [ ]:

# Generate a sex and age code by concatenating a one-digit sex code and a two-digit age code. 

df_customer[c("customer_id", "gender_cd", "birth_day", "age")] %>% 
 mutate(era = trunc(age / 10) * 10) %>% 
 mutate(era = case_when( era < 60 ~ formatC(era, width = 2,flag = "0"), 
 era >= 60 ~　formatC(60, width = 2,flag = "0"))) %>% 
 mutate(gender_era = paste(gender_cd, era, sep = "")) %>% 
slice(1:10)

A data.frame: 10 × 6

customer_id	gender_cd	birth_day	age	era	gender_era
<chr>	<chr>	<date>	<int>	<chr>	<chr>
CS021313000114	1	1981-04-29	37	30	130
CS037613000071	9	1952-04-01	66	60	960
CS031415000172	1	1976-10-04	42	40	140
CS028811000001	1	1933-03-27	86	60	160
CS001215000145	1	1995-03-29	24	20	120
CS020401000016	0	1974-09-15	44	40	040
CS015414000103	1	1977-08-09	41	40	140
CS029403000008	0	1973-08-17	45	40	040
CS015804000004	0	1931-05-02	87	60	060
CS033513000180	1	1962-07-11	56	50	150

Commentary :

This code manipulates a data frame called df_customer, which presumably contains information about customers. Here is an explanation of each line:

df_customer[c("customer_id", "gender_cd", "birth_day", "age")]: Selects the columns "customer_id", "gender_cd", "birth_day", and "age" from the data frame df_customer.

%>%: The pipe operator used to chain the data frame to the next operation.

mutate(era = trunc(age / 10) * 10): Adds a new column called "era" to the data frame, which is calculated by taking the age of each customer, dividing by 10, rounding down to the nearest integer, and multiplying by 10. This effectively bins the customers into 10-year age ranges.

mutate(era = case_when( era < 60 ~ formatC(era, width = 2,flag = "0"), era >= 60 ~ formatC(60, width = 2,flag = "0"))): Modifies the "era" column by applying a case_when statement, which converts the "era" values that are less than 60 to a two-digit string with a leading zero (e.g., "10" for ages 10-19), and sets the "era" values that are greater than or equal to 60 to "60".

mutate(gender_era = paste(gender_cd, era, sep = "")): Adds a new column called "gender_era" to the data frame, which is created by concatenating the "gender_cd" and "era" columns using the paste function. The sep = "" argument specifies that there should be no separator between the two values.

slice(1:10): Selects the first 10 rows of the data frame.

R-058: Create a dummy variable for the gender code (gender_cd) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id).

In [ ]:

# Code example 1 (all code values itemised) 
dummy_gender_model <- dummyVars(~gender_cd, data = df_customer, fullRank = FALSE) 

dummy_gender <- predict(dummy_gender_model, df_customer) 

head(cbind(df_customer["customer_id"], dummy_gender), 10)

A data.frame: 10 × 4

	customer_id	gender_cd0	gender_cd1	gender_cd9
	<chr>	<dbl>	<dbl>	<dbl>
1	CS021313000114	0	1	0
2	CS037613000071	0	0	1
3	CS031415000172	0	1	0
4	CS028811000001	0	1	0
5	CS001215000145	0	1	0
6	CS020401000016	1	0	0
7	CS015414000103	0	1	0
8	CS029403000008	1	0	0
9	CS015804000004	1	0	0
10	CS033513000180	0	1	0

Commentary :

This code creates dummy variables for a categorical variable "gender_cd" in the data frame df_customer. It first creates a model object using the dummyVars function from the "caret" package. The fullRank argument is set to FALSE to remove the first level of the categorical variable (which can be inferred from the other levels).

The second line applies the model to the original data frame df_customer to create a new data frame with the dummy variables. The predict function with the dummy_gender_model as the first argument and df_customer as the second argument applies the model to the data frame to create the dummy variables.

Finally, the code combines the customer_id column from df_customer with the dummy variables using the cbind function and displays the first 10 rows using head().

In [ ]:

# Code example 2 (one item can be deleted by setting fullRank=TRUE) 
dummy_gender_model <- dummyVars(~gender_cd, data = df_customer, fullRank = TRUE) 

dummy_gender <- predict(dummy_gender_model, df_customer) 

head(cbind(df_customer["customer_id"], dummy_gender), 3)

A data.frame: 3 × 3

	customer_id	gender_cd1	gender_cd9
	<chr>	<dbl>	<dbl>
1	CS021313000114	1	0
2	CS037613000071	0	1
3	CS031415000172	1	0

Commentary :

This code performs one-hot encoding on the gender_cd column of the df_customer dataframe using the dummyVars() and predict() functions from the caret package.

The dummyVars() function creates a set of dummy variables, representing the categories of the categorical variable specified in the formula. In this case, the formula ~gender_cd specifies that the gender_cd variable should be used to create the dummy variables. The fullRank = TRUE argument specifies that the reference level should be included in the dummy variable encoding.

The predict() function is then used to apply the encoding to the original df_customer dataframe. The resulting encoded dataframe is stored in the dummy_gender variable.

Finally, the head() function is used to display the first three rows of the dummy_gender dataframe, along with the customer_id column of the original df_customer dataframe, which is appended using cbind(). This allows us to see the encoding of the first few customers. The dummy variable encoding for each customer will consist of a 1 in the column corresponding to their gender and a 0 in the column corresponding to the other gender.

R-059: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), standardise the total sales amount to mean 0 and standard deviation 1 and display 10 cases with the customer ID and total sales amount. The standard deviation used for standardisation may be either the square root of the variance or the square root of the unbiased variance. However, exclude customer IDs starting with “Z”, as they represent non-members.

In [ ]:

# Standardised by standard deviation by square root of unbiased variance due to use of R scale 

df_receipt %>% 
 filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") %>% 
 mutate(std_amount = scale(sum_amount, center = TRUE, scale = TRUE)) %>% 
 slice(1:10)

A tibble: 10 × 3

customer_id	sum_amount	std_amount
<chr>	<int>	<dbl[,1]>
CS001113000004	1298	-0.4593502
CS001114000005	626	-0.7063478
CS001115000010	3044	0.1824025
CS001205000004	1988	-0.2057366
CS001205000006	3337	0.2900964
CS001211000025	456	-0.7688324
CS001212000027	448	-0.7717728
CS001212000031	296	-0.8276413
CS001212000046	228	-0.8526351
CS001212000070	456	-0.7688324

Commentary :

This code performs the following steps:

Filter rows in the df_receipt dataframe where the customer_id column does not start with the character "Z". This is done using the filter function from the dplyr package and the grepl function which matches a regular expression pattern to the customer_id column.

Group the resulting rows by customer_id using the group_by function.

Calculate the sum of the amount column for each group using the summarise function from dplyr.

Create a new column called std_amount that contains the standardized sum of amounts for each customer. The scale function is used to standardize the sum amounts by centering the data at the mean (center = TRUE) and scaling it to have a standard deviation of 1 (scale = TRUE).

Finally, the slice function is used to select the first 10 rows of the resulting dataframe, including the customer_id and std_amount columns.

R-060: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), normalise the total sales amount to a minimum value of 0 and a maximum value of 1 and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_receipt %>% filter(!grepl("^Z", customer_id)) %>% 
 group_by(customer_id) %>% 
 summarise(sum_amount = sum(amount), .groups = "drop") %>% 
 mutate( scale_amount = scale(sum_amount, center = min(sum_amount), scale = max(sum_amount) - min(sum_amount)))%>% 
 slice(1:10)

A tibble: 10 × 3

customer_id	sum_amount	scale_amount
<chr>	<int>	<dbl[,1]>
CS001113000004	1298	0.053354188
CS001114000005	626	0.024157108
CS001115000010	3044	0.129214460
CS001205000004	1988	0.083333333
CS001205000006	3337	0.141944734
CS001211000025	456	0.016770942
CS001212000027	448	0.016423358
CS001212000031	296	0.009819256
CS001212000046	228	0.006864790
CS001212000070	456	0.016770942

Commentary :

This code performs the following operations on the df_receipt data frame:

Filter out any rows where the customer_id starts with the letter "Z".

Group the remaining rows by customer_id.

Calculate the sum of amount for each group of customer_id.

Mutate the resulting data frame to add a new column scale_amount which is a scaled version of sum_amount. The scale function scales sum_amount so that its minimum value becomes 0 and its maximum value becomes 1.

Slice the resulting data frame to keep only the first 10 rows.

In summary, this code calculates the total amount of money spent by each customer in df_receipt, and then scales those values so that the minimum value becomes 0 and the maximum value becomes 1. The resulting scale_amount values can be useful for certain types of analysis, such as when you want to compare the relative spending patterns of different customers.

Data Science 100 Knocks (Structured Data Processing) - R

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - R Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - R Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - R Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - R Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - R Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.