参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- To begin with, run the following cells
- Import the required libraries and read the data from the database (PostgreSQL)
- The libraries that are expected to be used are imported in the following cells
- If there are other libraries you wish to use, install them as appropriate using install.packages().
- Names, addresses, etc. are dummy data and are not real.
require("RPostgreSQL")
require("tidyr")
require("dplyr")
require("stringr")
require("caret")
require("lubridate")
require("rsample")
require("recipes")
require("themis")
host <- "db"
port <- Sys.getenv()["PG_PORT"]
dbname <- Sys.getenv()["PG_DATABASE"]
user <- Sys.getenv()["PG_USER"]
password <- Sys.getenv()["PG_PASSWORD"]
con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password)
df_customer <- dbGetQuery(con,"SELECT * FROM customer")
df_category <- dbGetQuery(con,"SELECT * FROM category")
df_product <- dbGetQuery(con,"SELECT * FROM product")
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt")
df_store <- dbGetQuery(con,"SELECT * FROM store")
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")
Loading required package: RPostgreSQL Loading required package: DBI Loading required package: tidyr Loading required package: dplyr Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: stringr Loading required package: caret Loading required package: ggplot2 Loading required package: lattice Loading required package: lubridate Attaching package: ‘lubridate’ The following objects are masked from ‘package:base’: date, intersect, setdiff, union Loading required package: rsample Loading required package: recipes Attaching package: ‘recipes’ The following object is masked from ‘package:stringr’: fixed The following object is masked from ‘package:stats’: step Loading required package: themis Registered S3 methods overwritten by 'themis': method from bake.step_downsample recipes bake.step_upsample recipes prep.step_downsample recipes prep.step_upsample recipes tidy.step_downsample recipes tidy.step_upsample recipes tunable.step_downsample recipes tunable.step_upsample recipes Attaching package: ‘themis’ The following objects are masked from ‘package:recipes’: step_downsample, step_upsample
Practice Questions
R-061: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to ordinary logarithm (bottom 10) and display 10 items with customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
mutate(log_amount = log((sum_amount + 0.5), base = 10)) %>%
slice(1:10)
customer_id | sum_amount | log_amount |
---|---|---|
<chr> | <int> | <dbl> |
CS001113000004 | 1298 | 3.113442 |
CS001114000005 | 626 | 2.796921 |
CS001115000010 | 3044 | 3.483516 |
CS001205000004 | 1988 | 3.298526 |
CS001205000006 | 3337 | 3.523421 |
CS001211000025 | 456 | 2.659441 |
CS001212000027 | 448 | 2.651762 |
CS001212000031 | 296 | 2.472025 |
CS001212000046 | 228 | 2.358886 |
CS001212000070 | 456 | 2.659441 |
Commentary :
This code performs the following operations on thedf_receipt
data frame:
It filters out the rows where the customer_id starts with "Z" using thefilter
function.
It groups the data bycustomer_id
using thegroup_by
function.
It summarizes theamount
column for eachcustomer_id
by taking the sum using thesummarise
function.
It drops the groups created in step 2 using the.groups = "drop"
argument.
It creates a new column calledlog_amount
by taking the logarithm ofsum_amount
plus 0.5, using base 10.
It selects the first 10 rows of the resulting data frame using theslice
function.
Overall, this code creates a new column with the logarithm of the total amount spent by each customer in thedf_receipt
data frame, which is a common transformation used to normalize skewed data.
R-062: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to the natural logarithm (bottom e) and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
df_receipt %>%
filter(!grepl("^Z", customer_id)) %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
mutate(log_amount = log(sum_amount + 0.5)) %>%
slice(1:10)
customer_id | sum_amount | log_amount |
---|---|---|
<chr> | <int> | <dbl> |
CS001113000004 | 1298 | 7.168965 |
CS001114000005 | 626 | 6.440149 |
CS001115000010 | 3044 | 8.021092 |
CS001205000004 | 1988 | 7.595136 |
CS001205000006 | 3337 | 8.112977 |
CS001211000025 | 456 | 6.123589 |
CS001212000027 | 448 | 6.105909 |
CS001212000031 | 296 | 5.692047 |
CS001212000046 | 228 | 5.431536 |
CS001212000070 | 456 | 6.123589 |
Commentary :
This code processes a data frame calleddf_receipt
. It does the following steps:
Filters out rows where thecustomer_id
starts with "Z". This is done using thefilter()
function from thedplyr
package with the!grepl()
function as a condition.
Groups the remaining rows bycustomer_id
, using thegroup_by()
function from thedplyr
package.
Calculates the total amount of each customer's purchases using thesum()
function on theamount
column. This is done using thesummarise()
function from thedplyr
package.
Adds a new column calledlog_amount
that contains the natural logarithm of thesum_amount
column plus 0.5. This is done using themutate()
function from thedplyr
package and thelog()
function from base R.
Finally, it selects the first 10 rows of the resulting data frame using theslice()
function from thedplyr
package.
R-063: Calculate the profit amount of each product from the unit price(unit price) and cost of the product(unit_cost) data (df_product) and display 10 results.
df_product %>%
mutate(unit_profit = unit_price - unit_cost) %>%
slice(1:10)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | unit_profit |
---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <int> | <int> |
P040101001 | 04 | 0401 | 040101 | 198 | 149 | 49 |
P040101002 | 04 | 0401 | 040101 | 218 | 164 | 54 |
P040101003 | 04 | 0401 | 040101 | 230 | 173 | 57 |
P040101004 | 04 | 0401 | 040101 | 248 | 186 | 62 |
P040101005 | 04 | 0401 | 040101 | 268 | 201 | 67 |
P040101006 | 04 | 0401 | 040101 | 298 | 224 | 74 |
P040101007 | 04 | 0401 | 040101 | 338 | 254 | 84 |
P040101008 | 04 | 0401 | 040101 | 420 | 315 | 105 |
P040101009 | 04 | 0401 | 040101 | 498 | 374 | 124 |
P040101010 | 04 | 0401 | 040101 | 580 | 435 | 145 |
Commentary :
This code takes the data framedf_product
and creates a new columnunit_profit
using themutate()
function. Theunit_profit
column is calculated as the difference betweenunit_price
andunit_cost
, which are existing columns indf_product
.
Theslice()
function is then used to show the first 10 rows of the resulting data frame with the newunit_profit
column included.
R-064: Calculate the overall average profit margin for each product from the unit price (unit_price) and cost (unit_cost) of the product data (df_product). Note, however, that unit_price and cost are missing.
df_product %>%
mutate(unit_profit_rate = (unit_price - unit_cost) / unit_price) %>%
summarise(total_mean = mean(unit_profit_rate, na.rm = TRUE))
total_mean |
---|
<dbl> |
0.2491139 |
Commentary :
This code is performing some calculations on a dataframe nameddf_product
.mutate
is used to create a new column namedunit_profit_rate
. This column is calculated by subtracting theunit_cost
from theunit_price
and then dividing the result byunit_price
.
Thesummarise
function is then used to compute the mean of theunit_profit_rate
column. The result is assigned to a new column namedtotal_mean
.
In summary, this code is calculating the average profit rate for each unit of product sold indf_product
.
R-065: For each product in the product data (df_product), find the new unit price that gives a profit margin of 30%. However, round down to the nearest yen. Then display 10 results and confirm that the profit margin is approximately around 30%. Note that there is a deficit in unit_price(unit_price) and unit_cost(unit_cost).
df_product[c("product_cd", "unit_price", "unit_cost")] %>%
mutate(new_price = trunc(unit_cost / 0.7)) %>%
mutate(new_profit_rate = (new_price - unit_cost)/ new_price) %>% slice(1:10)
product_cd | unit_price | unit_cost | new_price | new_profit_rate |
---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <dbl> |
P040101001 | 198 | 149 | 212 | 0.2971698 |
P040101002 | 218 | 164 | 234 | 0.2991453 |
P040101003 | 230 | 173 | 247 | 0.2995951 |
P040101004 | 248 | 186 | 265 | 0.2981132 |
P040101005 | 268 | 201 | 287 | 0.2996516 |
P040101006 | 298 | 224 | 320 | 0.3000000 |
P040101007 | 338 | 254 | 362 | 0.2983425 |
P040101008 | 420 | 315 | 450 | 0.3000000 |
P040101009 | 498 | 374 | 534 | 0.2996255 |
P040101010 | 580 | 435 | 621 | 0.2995169 |
Commentary :
This code performs the following steps:
It selects only the columns "product_cd", "unit_price", and "unit_cost" from the data frame "df_product".
It creates a new column called "new_price" by dividing the unit cost by 0.7 and truncating the result to the nearest integer. This is an arbitrary calculation intends to lower the price of the product by a certain percentage.
It creates another new column called "new_profit_rate" by subtracting the unit cost from the new price and dividing the result by the new price. This is meant to calculate the profit rate for the new price.
Finally, it selects the first 10 rows of the resulting data frame.
R-066: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round to the nearest yen (rounding or rounding to even numbers is fine). Then display 10 results and check that the profit margin is around 30%. Note, however, that there are deficiencies in unit_price(unit_price) and unit_cost(unit_cost).
df_product[c("product_cd", "unit_price", "unit_cost")] %>%
mutate(new_price = round(unit_cost / 0.7)) %>%
mutate(new_profit_rate = (new_price - unit_cost) / new_price) %>%
slice(1:10)
product_cd | unit_price | unit_cost | new_price | new_profit_rate |
---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <dbl> |
P040101001 | 198 | 149 | 213 | 0.3004695 |
P040101002 | 218 | 164 | 234 | 0.2991453 |
P040101003 | 230 | 173 | 247 | 0.2995951 |
P040101004 | 248 | 186 | 266 | 0.3007519 |
P040101005 | 268 | 201 | 287 | 0.2996516 |
P040101006 | 298 | 224 | 320 | 0.3000000 |
P040101007 | 338 | 254 | 363 | 0.3002755 |
P040101008 | 420 | 315 | 450 | 0.3000000 |
P040101009 | 498 | 374 | 534 | 0.2996255 |
P040101010 | 580 | 435 | 621 | 0.2995169 |
Commentary :
This code is performing some calculations on the product data frame (df_product
).
First, it is selecting three columns,"product_cd"
,"unit_price"
, and"unit_cost"
.
Then, it is creating a new column,"new_price"
, which is calculated by dividing"unit_cost"
by 0.7 (i.e., reducing the cost by 30%) and rounding the result to the nearest whole number using theround()
function.
Next, it creates a new column,"new_profit_rate"
, which is calculated by subtracting"unit_cost"
from"new_price"
and dividing the result by"new_price"
.
Finally, it selects the first 10 rows of the resulting data frame and displays them.
This code essentially calculates the new price and profit rate for each product if the cost of production is reduced by 30%.
R-067: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round up to the nearest yen. Then display 10 results and check that the profit margin is around 30%. Note, however, that there is a deficit in the unit price (unit_price) and cost (unit_cost).
df_product[c("product_cd", "unit_price", "unit_cost")] %>%
mutate(new_price = ceiling(unit_cost / 0.7)) %>%
mutate(new_profit_rate = (new_price - unit_cost) / new_price) %>%
slice(1:10)
product_cd | unit_price | unit_cost | new_price | new_profit_rate |
---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <dbl> |
P040101001 | 198 | 149 | 213 | 0.3004695 |
P040101002 | 218 | 164 | 235 | 0.3021277 |
P040101003 | 230 | 173 | 248 | 0.3024194 |
P040101004 | 248 | 186 | 266 | 0.3007519 |
P040101005 | 268 | 201 | 288 | 0.3020833 |
P040101006 | 298 | 224 | 320 | 0.3000000 |
P040101007 | 338 | 254 | 363 | 0.3002755 |
P040101008 | 420 | 315 | 451 | 0.3015521 |
P040101009 | 498 | 374 | 535 | 0.3009346 |
P040101010 | 580 | 435 | 622 | 0.3006431 |
Commentary :
This code manipulates thedf_product
data frame by adding two new variablesnew_price
andnew_profit_rate
.mutate(new_price = ceiling(unit_cost / 0.7))
creates a new columnnew_price
that contains the result of the expressionceiling(unit_cost / 0.7)
. Theceiling()
function rounds the result of the expressionunit_cost / 0.7
up to the nearest integer. This results in a new price that is 30% higher than the unit cost.mutate(new_profit_rate = (new_price - unit_cost) / new_price)
creates a new columnnew_profit_rate
that contains the result of the expression(new_price - unit_cost) / new_price
. This expression calculates the new profit rate, which is the percentage of the new price that represents profit. This is calculated as the difference between the new price and the unit cost, divided by the new price.
Theslice(1:10)
function is used to display the first 10 rows of the resulting data frame.
R-068: For each product in the product data (df_product), find the amount including tax at a consumption tax rate of 10%, rounding down fractions of a yen and displaying 10 results. Note that the unit price (unit_price) is missing.
head(cbind(product_cd = df_product$product_cd,
df_product$unit_price,
tax_price = trunc(df_product$unit_price * 1.1)), 10)
product_cd | unit_price | tax_price |
---|---|---|
P040101001 | 198 | 217 |
P040101002 | 218 | 239 |
P040101003 | 230 | 253 |
P040101004 | 248 | 272 |
P040101005 | 268 | 294 |
P040101006 | 298 | 327 |
P040101007 | 338 | 371 |
P040101008 | 420 | 462 |
P040101009 | 498 | 547 |
P040101010 | 580 | 638 |
Commentary :
This code uses thecbind
function to combine three columns ofdf_product
data frame, i.e.,product_cd
,unit_price
, andtax_price
, into a new data frame. Thehead
function is then used to display the first ten rows of this new data frame.
The first column of the new data frame is labeledproduct_cd
and contains the values of theproduct_cd
column fromdf_product
. The second column is labeleddf_product$unit_price
and contains the values of theunit_price
column fromdf_product
.
The third column is labeledtax_price
and is created by applying thetrunc
function to the result of multiplying each value of theunit_price
column by1.1
. This calculates the price of the product including a 10% tax.
Note that the use ofdf_product$
in the column name for the second column is unnecessary as thecbind
function already extracts theunit_price
column fromdf_product
.
R-069: Combine receipt details data (df_receipt) and product data (df_product), calculate the total sales value of all products for each customer and the total sales value of products whose category major classification code (category_major_cd) is “07” (bottled canned food), and find the ratio between the two. Only customers with a sales record for category major category code “07” (bottled canned food) should be selected, and 10 results should be displayed.
# Code example 1
df_tmp_1 <- df_receipt %>%
group_by(customer_id) %>%
summarise(sum_all=sum(amount))
df_tmp_2 <- inner_join(df_receipt, df_product[c("product_cd","category_major_cd")], by="product_cd") %>%
filter(category_major_cd == "07") %>% group_by(customer_id) %>%
summarise(sum_07 = sum(amount), .groups = "drop")
inner_join(df_tmp_1, df_tmp_2, by = "customer_id") %>%
mutate(sales_rate = sum_07 / sum_all) %>%
slice(1:10)
customer_id | sum_all | sum_07 | sales_rate |
---|---|---|---|
<chr> | <int> | <int> | <dbl> |
CS001113000004 | 1298 | 1298 | 1.0000000 |
CS001114000005 | 626 | 486 | 0.7763578 |
CS001115000010 | 3044 | 2694 | 0.8850197 |
CS001205000004 | 1988 | 346 | 0.1740443 |
CS001205000006 | 3337 | 2004 | 0.6005394 |
CS001212000027 | 448 | 200 | 0.4464286 |
CS001212000031 | 296 | 296 | 1.0000000 |
CS001212000046 | 228 | 108 | 0.4736842 |
CS001212000070 | 456 | 308 | 0.6754386 |
CS001213000018 | 243 | 145 | 0.5967078 |
Commentary :
This code performs the following operations:
Compute the total amount spent by each customer in thedf_receipt
data frame, and store it indf_tmp_1
.
Joindf_receipt
anddf_product
data frames byproduct_cd
, and filter out only the rows wherecategory_major_cd
is equal to"07"
.
Compute the sum of theamount
column for each customer in the filtered data frame, and store it indf_tmp_2
.
Joindf_tmp_1
anddf_tmp_2
data frames bycustomer_id
.
Compute the sales rate of category major code"07"
for each customer, and store it in a new column namedsales_rate
.
Slice the resulting data frame to show the first 10 rows.
Theinner_join
function is used to merge data frames on a common variable. Themutate
function adds a new column to the data frame, which is calculated based on existing columns. Theslice
function is used to extract a subset of rows from the data frame.
# Code example 2 (%>% to link all processes)
inner_join(df_receipt, df_product[c("product_cd","category_major_cd")], by="product_cd") %>% filter(category_major_cd == "07") %>%
group_by(customer_id) %>%
summarise(sum_07 = sum(amount), .groups = "drop") %>%
inner_join(df_receipt, by="customer_id") %>%
group_by(customer_id) %>%
summarise(sum_all=sum(amount), sum_07=max(sum_07), .groups = "drop") %>%
mutate(sales_rate = sum_07 / sum_all) %>%
slice(1:10)
customer_id | sum_all | sum_07 | sales_rate |
---|---|---|---|
<chr> | <int> | <int> | <dbl> |
CS001113000004 | 1298 | 1298 | 1.0000000 |
CS001114000005 | 626 | 486 | 0.7763578 |
CS001115000010 | 3044 | 2694 | 0.8850197 |
CS001205000004 | 1988 | 346 | 0.1740443 |
CS001205000006 | 3337 | 2004 | 0.6005394 |
CS001212000027 | 448 | 200 | 0.4464286 |
CS001212000031 | 296 | 296 | 1.0000000 |
CS001212000046 | 228 | 108 | 0.4736842 |
CS001212000070 | 456 | 308 | 0.6754386 |
CS001213000018 | 243 | 145 | 0.5967078 |
Commentary :
This code is calculating the sales rate of category 07 products for the top 10 customers who have bought the most of those products. Here is a breakdown of what each line does:inner_join(df_receipt, df_product[c("product_cd","category_major_cd")], by="product_cd")
: This joins thedf_receipt
anddf_product
data frames based on theproduct_cd
column, and only keeps theproduct_cd
andcategory_major_cd
columns fromdf_product
. The result is a new data frame that has information on the category of each product purchased indf_receipt
.%>% filter(category_major_cd == "07")
: This filters the joined data frame from step 1 to only keep rows where thecategory_major_cd
column is equal to "07".%>% group_by(customer_id) %>% summarise(sum_07 = sum(amount), .groups = "drop")
: This groups the filtered data frame from step 2 bycustomer_id
and calculates the sum of theamount
column for each group. The result is a new data frame that has the total amount spent on category 07 products by each customer.%>% inner_join(df_receipt, by="customer_id")
: This joins the data frame from step 3 with the originaldf_receipt
data frame based on thecustomer_id
column. The result is a new data frame that has the total amount spent by each customer on all products.%>% group_by(customer_id) %>% summarise(sum_all=sum(amount), sum_07=max(sum_07), .groups = "drop")
: This groups the joined data frame from step 4 bycustomer_id
and calculates the sum of theamount
column for each group, as well as the maximum value of thesum_07
column for each group. The result is a new data frame that has the total amount spent by each customer on all products, as well as the maximum amount spent by each customer on category 07 products.%>% mutate(sales_rate = sum_07 / sum_all)
: This calculates the sales rate for each customer by dividing the maximum amount spent on category 07 products (sum_07
) by the total amount spent on all products (sum_all
).%>% slice(1:10)
: This keeps only the top 10 rows of the data frame based on the sales rate column. The result is a new data frame that shows the top 10 customers with the highest sales rate for category 07 products, along with their total amount spent on all products and their maximum amount spent on category 07 products.
R-070: Calculate the number of days elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt) and display 10 items with the customer ID (customer_id), sales date and membership application date. (Note that sales_ymd is numeric and application_date is a string).
df_receipt[c("customer_id", "sales_ymd")] %>%
distinct(.,.keep_all=TRUE) %>% inner_join(df_customer[c("customer_id","application_date")], by="customer_id") %>% mutate(elapsed_days = as.integer( strptime(as.character(sales_ymd), "%Y%m%d") - strptime(application_date, "%Y%m%d"))) %>%
select(customer_id, sales_ymd, application_date, elapsed_days) %>%
slice(1:10)
customer_id | sales_ymd | application_date | elapsed_days |
---|---|---|---|
<chr> | <int> | <chr> | <int> |
CS006214000001 | 20181103 | 20150201 | 1371 |
CS008415000097 | 20181118 | 20150322 | 1337 |
CS028414000014 | 20170712 | 20150711 | 732 |
CS025415000050 | 20180821 | 20160131 | 933 |
CS003515000195 | 20190605 | 20150306 | 1552 |
CS024514000042 | 20181205 | 20151010 | 1152 |
CS040415000178 | 20190922 | 20150627 | 1548 |
CS027514000015 | 20191010 | 20151101 | 1439 |
CS025415000134 | 20190918 | 20150720 | 1521 |
CS021515000126 | 20171010 | 20150508 | 886 |
Commentary :
This code performs the following tasks:
Selects the columns 'customer_id' and 'sales_ymd' from the 'df_receipt' data frame.
Removes duplicated rows while keeping all columns ('distinct').
Joins the resulting data frame with the 'df_customer' data frame by 'customer_id'.
Adds a new column 'elapsed_days' that represents the difference in days between the 'sales_ymd' and 'application_date' columns, converted to integers.
Selects only the 'customer_id', 'sales_ymd', 'application_date', and 'elapsed_days' columns.
Returns the first 10 rows of the resulting data frame.
The inner workings of the code can be broken down into the following steps:
# Select customer_id and sales_ymd columns from df_receipt
df1 <- df_receipt[c("customer_id", "sales_ymd")]
# Remove duplicated rows while keeping all columns
df2 <- distinct(df1, .keep_all = TRUE)
# Join with df_customer on customer_id
df3 <- inner_join(df2, df_customer[c("customer_id", "application_date")], by = "customer_id")
# Add a new column with elapsed days
df4 <- mutate(df3, elapsed_days = as.integer(strptime(as.character(sales_ymd), "%Y%m%d") - strptime(application_date, "%Y%m%d")))
# Select columns and return first 10 rows
df5 <- select(df4, customer_id, sales_ymd, application_date, elapsed_days)
slice(df5, 1:10)
The code uses the dplyr
package to perform data manipulations. The distinct
function removes duplicated rows based on all columns, and the inner_join
function joins two data frames based on a common column. The mutate
function creates a new column in a data frame, and the select
function selects specific columns from a data frame. Finally, the slice
function returns a subset of rows from a data frame.
R-071: Calculate the number of months elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).
df_receipt[c("customer_id", "sales_ymd")] %>%
distinct(., .keep_all = TRUE) %>%
inner_join(df_customer[c("customer_id", "application_date")], by = "customer_id") %>% mutate(elapsed_months = trunc(time_length( interval( strptime(application_date, "%Y%m%d"), strptime(as.character(sales_ymd), "%Y%m%d") ), "month"))) %>%
select(customer_id, sales_ymd, application_date, elapsed_months) %>%
slice(1:10)
customer_id | sales_ymd | application_date | elapsed_months |
---|---|---|---|
<chr> | <int> | <chr> | <dbl> |
CS006214000001 | 20181103 | 20150201 | 45 |
CS008415000097 | 20181118 | 20150322 | 43 |
CS028414000014 | 20170712 | 20150711 | 24 |
CS025415000050 | 20180821 | 20160131 | 30 |
CS003515000195 | 20190605 | 20150306 | 50 |
CS024514000042 | 20181205 | 20151010 | 37 |
CS040415000178 | 20190922 | 20150627 | 50 |
CS027514000015 | 20191010 | 20151101 | 47 |
CS025415000134 | 20190918 | 20150720 | 49 |
CS021515000126 | 20171010 | 20150508 | 29 |
Commentary :
This code performs the following operations:df_receipt[c("customer_id", "sales_ymd")]
selects only the columnscustomer_id
andsales_ymd
from thedf_receipt
data frame.distinct(., .keep_all = TRUE)
removes duplicate rows while keeping all columns. This ensures that there is only one row for each unique combination ofcustomer_id
andsales_ymd
.inner_join(df_customer[c("customer_id", "application_date")], by = "customer_id")
joins the result from the previous step with thedf_customer
data frame on thecustomer_id
column, selecting only thecustomer_id
andapplication_date
columns fromdf_customer
.mutate(elapsed_months = trunc(time_length( interval( strptime(application_date, "%Y%m%d"), strptime(as.character(sales_ymd), "%Y%m%d") ), "month")))
creates a new columnelapsed_months
that calculates the number of months between theapplication_date
andsales_ymd
dates. This is done by converting the dates to the appropriate format usingstrptime
, calculating the difference using theinterval
function, and then usingtime_length
to calculate the number of months. Thetrunc
function is used to round down to the nearest integer.select(customer_id, sales_ymd, application_date, elapsed_months)
selects only the columnscustomer_id
,sales_ymd
,application_date
, andelapsed_months
.slice(1:10)
selects only the first 10 rows of the resulting data frame.
Overall, this code is calculating the number of months between the application date and the sales date for each customer, based on their purchase history.
R-072: Calculate the number of years elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (df_customer) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).
df_receipt[c("customer_id", "sales_ymd")] %>%
distinct(., .keep_all = TRUE) %>%
inner_join(df_customer[c("customer_id", "application_date")], by = "customer_id") %>% mutate(elapsed_years = trunc(time_length(interval( strptime(application_date, "%Y%m%d"), strptime(as.character(sales_ymd), "%Y%m%d")), "year")))%>%
select(customer_id, sales_ymd, application_date, elapsed_years) %>%
slice(1:10)
customer_id | sales_ymd | application_date | elapsed_years |
---|---|---|---|
<chr> | <int> | <chr> | <dbl> |
CS006214000001 | 20181103 | 20150201 | 3 |
CS008415000097 | 20181118 | 20150322 | 3 |
CS028414000014 | 20170712 | 20150711 | 2 |
CS025415000050 | 20180821 | 20160131 | 2 |
CS003515000195 | 20190605 | 20150306 | 4 |
CS024514000042 | 20181205 | 20151010 | 3 |
CS040415000178 | 20190922 | 20150627 | 4 |
CS027514000015 | 20191010 | 20151101 | 3 |
CS025415000134 | 20190918 | 20150720 | 4 |
CS021515000126 | 20171010 | 20150508 | 2 |
Commentary :
This code performs the following operations:
Extracts two columns,customer_id
andsales_ymd
, from thedf_receipt
data frame.
Removes duplicated rows of the above two columns and keeps only the first occurrence of each unique pair ofcustomer_id
andsales_ymd
.
Joins the resulting data frame withdf_customer
data frame on thecustomer_id
column.
Computes the elapsed years between theapplication_date
andsales_ymd
dates for each customer by creating an interval usinginterval
function and then usingtime_length
function to extract the number of years. Thetrunc
function is used to round down the result to an integer value.
Selects thecustomer_id
,sales_ymd
,application_date
, andelapsed_years
columns.
Finally, selects the first 10 rows of the resulting data frame using theslice
function.
In summary, this code computes the elapsed years between a customer'sapplication_date
and theirsales_ymd
date.
R-073: Calculate the elapsed time in epoch seconds from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and (Note that sales_ymd is held as a number and application_date as a string). Note that no time information is held, so each date should represent 0:00:00:00.
df_receipt[c("customer_id", "sales_ymd")] %>%
distinct(., .keep_all = TRUE) %>% inner_join(df_customer[c("customer_id","application_date")], by="customer_id") %>% mutate(elapsed_epoch = as.numeric(strptime(as.character(sales_ymd), "%Y%m%d")) - as.numeric(strptime(application_date, "%Y%m%d"))) %>%
select(customer_id, sales_ymd, application_date, elapsed_epoch) %>%
slice(1:10)
customer_id | sales_ymd | application_date | elapsed_epoch |
---|---|---|---|
<chr> | <int> | <chr> | <dbl> |
CS006214000001 | 20181103 | 20150201 | 118454400 |
CS008415000097 | 20181118 | 20150322 | 115516800 |
CS028414000014 | 20170712 | 20150711 | 63244800 |
CS025415000050 | 20180821 | 20160131 | 80611200 |
CS003515000195 | 20190605 | 20150306 | 134092800 |
CS024514000042 | 20181205 | 20151010 | 99532800 |
CS040415000178 | 20190922 | 20150627 | 133747200 |
CS027514000015 | 20191010 | 20151101 | 124329600 |
CS025415000134 | 20190918 | 20150720 | 131414400 |
CS021515000126 | 20171010 | 20150508 | 76550400 |
Commentary :
This code is processing some data related to customers and their purchase history. Here is an explanation of the code line by line:df_receipt[c("customer_id", "sales_ymd")]
: select the columns "customer_id" and "sales_ymd" from the "df_receipt" dataframe.distinct(., .keep_all = TRUE)
: remove duplicate rows while keeping all columns.inner_join(df_customer[c("customer_id","application_date")], by="customer_id")
: join the "df_customer" dataframe using the "customer_id" column.mutate(elapsed_epoch = as.numeric(strptime(as.character(sales_ymd), "%Y%m%d")) - as.numeric(strptime(application_date, "%Y%m%d")))
: create a new column called "elapsed_epoch" that calculates the number of seconds between the "sales_ymd" column and "application_date" column.select(customer_id, sales_ymd, application_date, elapsed_epoch)
: select the columns "customer_id", "sales_ymd", "application_date", and "elapsed_epoch" from the joined dataframe.slice(1:10)
: select the first 10 rows of the resulting dataframe.
R-074: For the sales date (sales_ymd) of the receipt details data (df_receipt), calculate the number of days elapsed since Monday of the week in question and display 10 items together with the sales date and the previous Monday (note that sales_ymd holds data in numerical form).
df_receipt["sales_ymd"] %>%
# The following converts Sunday to the last day of the previous week by shifting it one day forward because the start day is Sunday.
# In this situation, floor_date will get Sunday of the previous week as the start day of the week, so. # 1 is added to make the start day Monday.
mutate(monday = as.Date(floor_date( strptime(as.character(sales_ymd), "%Y%m%d") - 1 , unit = "week")) + 1) %>%
mutate(elapsed_days = as.integer( as.Date( strptime(as.character(sales_ymd), "%Y%m%d")) - monday)) %>%
select(sales_ymd, elapsed_days, monday) %>%
slice(1:10)
sales_ymd | elapsed_days | monday |
---|---|---|
<int> | <int> | <date> |
20181103 | 5 | 2018-10-29 |
20181118 | 6 | 2018-11-12 |
20170712 | 2 | 2017-07-10 |
20190205 | 1 | 2019-02-04 |
20180821 | 1 | 2018-08-20 |
20190605 | 2 | 2019-06-03 |
20181205 | 2 | 2018-12-03 |
20190922 | 6 | 2019-09-16 |
20170504 | 3 | 2017-05-01 |
20191010 | 3 | 2019-10-07 |
Commentary :
This code calculates the number of elapsed days since the start of the week (Monday) for each sales date in thedf_receipt
data frame. Here is a breakdown of the code:df_receipt["sales_ymd"]
: This selects thesales_ymd
column from thedf_receipt
data frame.%>%
: This is the pipe operator that passes the result from the previous operation to the next operation.mutate(monday = as.Date(floor_date( strptime(as.character(sales_ymd), "%Y%m%d") - 1 , unit = "week")) + 1)
: This creates a new column calledmonday
that represents the Monday of the week for each sales date. It does this by converting thesales_ymd
column to a date object, subtracting one day, rounding down to the previous week (with Sunday being the end of the week), and adding one day to make it the start of the week (Monday).%>%
: Another pipe operator.mutate(elapsed_days = as.integer( as.Date( strptime(as.character(sales_ymd), "%Y%m%d")) - monday))
: This creates a new column calledelapsed_days
that represents the number of days that have elapsed since the start of the week (Monday) for each sales date. It does this by converting thesales_ymd
column to a date object, subtracting the Monday of the week (from themonday
column), and then converting the result to an integer.%>%
: Another pipe operator.select(sales_ymd, elapsed_days, monday)
: This selects thesales_ymd
,elapsed_days
, andmonday
columns.%>%
: Another pipe operator.slice(1:10)
: This selects the first ten rows of the resulting data frame.
R-075: Randomly extract 1% of the data from the customer data (df_customer) and display the first 10 items.
head(sample_frac(tbl = df_customer, size = 0.01), 10)
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | CS001612000311 | 岡本 美佳 | 1 | 女性 | 1951-03-21 | 68 | 211-0015 | 神奈川県川崎市中原区北谷町********** | S13001 | 20171118 | 0-00000000-0 |
2 | CS018415000209 | 西村 由美子 | 1 | 女性 | 1974-11-26 | 44 | 203-0051 | 東京都東久留米市小山********** | S13018 | 20151001 | C-20101026-D |
3 | CS020415000109 | 岡村 里奈 | 1 | 女性 | 1969-02-02 | 50 | 173-0001 | 東京都板橋区本町********** | S13020 | 20150207 | A-20100922-D |
4 | CS040403000046 | 辻 獅童 | 0 | 男性 | 1977-01-23 | 42 | 226-0016 | 神奈川県横浜市緑区霧が丘********** | S14040 | 20150813 | 0-00000000-0 |
5 | CS044313000008 | 河原 育子 | 1 | 女性 | 1985-06-03 | 33 | 144-0056 | 東京都大田区西六郷********** | S13044 | 20161204 | 0-00000000-0 |
6 | CS038502000025 | 小柳 ケンイチ | 0 | 男性 | 1963-06-01 | 55 | 134-0015 | 東京都江戸川区西瑞江********** | S13038 | 20151214 | 0-00000000-0 |
7 | CS034402000064 | 川越 明 | 0 | 男性 | 1972-11-27 | 46 | 213-0031 | 神奈川県川崎市高津区宇奈根********** | S14034 | 20170913 | 0-00000000-0 |
8 | CS015303000005 | 寺西 一徳 | 0 | 男性 | 1987-01-16 | 32 | 135-0016 | 東京都江東区東陽********** | S13015 | 20150223 | 0-00000000-0 |
9 | CS029411000007 | 森永 璃子 | 1 | 女性 | 1976-12-01 | 42 | 134-0085 | 東京都江戸川区南葛西********** | S12029 | 20150709 | 5-20101007-9 |
10 | CS048612000001 | 劇団 恵子 | 1 | 女性 | 1956-01-11 | 63 | 224-0053 | 神奈川県横浜市都筑区池辺町********** | S14048 | 20170920 | 0-00000000-0 |
Commentary :
This code selects a random sample of 1% of the rows in thedf_customer
dataframe using thesample_frac()
function from thedplyr
package, and then returns the first 10 rows of the resulting dataframe using thehead()
function.
Thesample_frac()
function is used to randomly sample a fraction of rows from a dataframe. In this case, thetbl
argument specifies the dataframe to sample from, and thesize
argument specifies the fraction of rows to sample (0.01, or 1%). The resulting dataframe is then passed to thehead()
function to return the first 10 rows.
R-076: Extract 10% of the data stratified randomly from the customer data (df_customer) based on the percentage of gender code (gender_cd) and count the number of cases by gender code
set.seed(71) df_customer %>%
group_by(gender_cd) %>%
sample_frac(0.1) %>%
summarise(customer_num = n(), .groups = "drop")
gender_cd | customer_num |
---|---|
<chr> | <int> |
0 | 298 |
1 | 1792 |
9 | 107 |
Commentary :
This code uses the dplyr package to randomly sample 10% of customers from each gender group and count the number of customers in each sample.set.seed(71)
sets a seed for the random number generator to ensure reproducibility of the sampling.df_customer %>%
takes thedf_customer
data frame as input and passes it to the next operation using the pipe operator%>%
.group_by(gender_cd)
groups the data by the gender code column.sample_frac(0.1)
samples 10% of customers from each gender group. Thesample_frac()
function is used to randomly sample a fraction of rows from the grouped data frame. The argument0.1
specifies the fraction to be sampled.summarise(customer_num = n(), .groups = "drop")
calculates the number of customers in each gender group. Then()
function counts the number of rows in each group, and renames the resulting column tocustomer_num
. The.groups
argument is set to"drop"
to remove the grouping information from the output.
The final output will be a data frame with two columns:gender_cd
andcustomer_num
. Thegender_cd
column specifies the gender code, and thecustomer_num
column specifies the number of customers sampled from each gender group.
R-077: Sum the sales amounts in the receipt details data (df_receipt) by customer unit and extract outliers of the summed sales amounts. The outliers should be calculated by logarithmising the total sales amount, calculating the mean and standard deviation, and then deviating from the mean by more than 3σ (either the natural logarithm or the ordinary logarithm is acceptable). Display 10 results.
df_receipt %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
mutate(log_sum_amount = log(sum_amount + 0.5)) %>%
filter(abs(log_sum_amount - mean(log_sum_amount)) / sd(log_sum_amount) > 3) %>% slice(1:10)
customer_id | sum_amount | log_sum_amount |
---|---|---|
<chr> | <int> | <dbl> |
ZZ000000000000 | 12395003 | 16.3328 |
Commentary :
This code is processing a data frame calleddf_receipt
and doing the following:
Grouping the data frame bycustomer_id
usinggroup_by
function.
Calculating the sum ofamount
for each customer usingsummarise
function and storing it in a new column calledsum_amount
.
Taking the logarithm ofsum_amount
column and storing it in a new column calledlog_sum_amount
.
Calculating the z-score oflog_sum_amount
by subtracting the mean oflog_sum_amount
from each value and then dividing the result by the standard deviation oflog_sum_amount
.
Filtering out customers whoselog_sum_amount
z-score is greater than 3 (i.e., those whoselog_sum_amount
is more than three standard deviations away from the mean).
Selecting the first 10 rows of the resulting data frame using theslice
function.
In summary, the code is identifying customers whose total spending is significantly higher or lower than the average by calculating their z-score and filtering out those whose z-score exceeds a certain threshold.
R-078: Sum the sales amount (amount) of the receipt details data (df_receipt) in customer units and extract outliers of the summed sales amount. However, exclude those whose customer ID starts with “Z”, as they represent non-members. Outliers are defined as the difference between the first and third quartiles using IQR, and are defined as being below “first quartile – 1.5 x IQR” or above “third quartile + 1.5 x IQR”. Show 10 results.
df_receipt %>%
group_by(customer_id) %>%
filter(!grepl("^Z", customer_id)) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
filter( sum_amount < quantile(sum_amount)[2] - 1.5 * (quantile(sum_amount)[4] - quantile(sum_amount)[2]) | sum_amount > quantile(sum_amount)[4] + 1.5 * (quantile(sum_amount)[4] - quantile(sum_amount)[2]) ) %>%
slice(1:10)
customer_id | sum_amount |
---|---|
<chr> | <int> |
CS001414000048 | 8584 |
CS001605000009 | 18925 |
CS002415000594 | 9568 |
CS004414000181 | 9584 |
CS005415000137 | 8734 |
CS006414000001 | 9156 |
CS006414000029 | 9179 |
CS006415000105 | 10042 |
CS006415000147 | 12723 |
CS006415000157 | 10648 |
Commentary :
This code performs the following operations:
Group thedf_receipt
data frame bycustomer_id
.
Filter out all the rows where thecustomer_id
starts with the letter "Z".
Sum theamount
for eachcustomer_id
.
Filter out the top and bottom 1.5% of thesum_amount
values using the interquartile range (IQR) method.
Select the first 10 rows of the resulting data frame.
The IQR method is a way to detect and remove outliers from a dataset based on the quartiles. In this case, the upper and lower bounds are calculated as follows:
Upper bound:quantile(sum_amount)[4] + 1.5 * (quantile(sum_amount)[4] - quantile(sum_amount)[2])
Lower bound:quantile(sum_amount)[2] - 1.5 * (quantile(sum_amount)[4] - quantile(sum_amount)[2])
Anysum_amount
values that fall outside of these bounds are filtered out. The resulting data frame contains thecustomer_id
andsum_amount
for the first 10 rows that meet the filter condition.
R-079: For each item of product data (df_product), check the number of missing items.
sapply(df_product, function(x) sum(is.na(x)))
- product_cd
- 0
- category_major_cd
- 0
- category_medium_cd
- 0
- category_small_cd
- 0
- unit_price
- 7
- unit_cost
- 7
Commentary :
The code computes the number of missing values in each column of thedf_product
dataframe using thesapply()
function.sapply()
applies a function to each column of the dataframedf_product
. Here, the function used isfunction(x) sum(is.na(x))
, which computes the number of missing values (NA) in the input columnx
.is.na()
returns a logical vector of the same length as the input vector withTRUE
for any missing value (NA) andFALSE
otherwise.sum()
is used to sum up theTRUE
values in the logical vector, which gives the count of missing values in the column.
The output of the code is a named vector with column names ofdf_product
as names and the corresponding counts of missing values as values.
R-080: Create a new product data, deleting all records where any item in the product data (df_product) is missing. Note that the number of items before and after the deletion should be displayed, and also confirm that the number of items has decreased by the number of items checked in 079.
df_product_1 <- na.omit(df_product)
paste("Before deletion:", nrow(df_product))
paste("After deletion:", nrow(df_product_1))
'Before deletion: 10030'
'After deletion: 10023'
Commentary :
The first line of code creates a new data framedf_product_1
that removes all rows with missing values (NA) from the original data framedf_product
, using thena.omit()
function.
The second line of code uses thepaste()
function to print a message indicating the number of rows indf_product
before and after the removal of NAs. Specifically,nrow(df_product)
returns the number of rows indf_product
, and the firstpaste()
call concatenates this number with the string "Before deletion:". Similarly,nrow(df_product_1)
returns the number of rows in the new data framedf_product_1
, and the secondpaste()
call concatenates this number with the string "After deletion:".
This code provides a quick way to check the effect of removing NAs on the number of observations in the data frame.
Comment