参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_R.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction

To begin with, run the following cells
Import the required libraries and read the data from the database (PostgreSQL)
The libraries that are expected to be used are imported in the following cells
If there are other libraries you wish to use, install them as appropriate using install.packages().
Names, addresses, etc. are dummy data and are not real.

In [ ]:

require("RPostgreSQL") 
require("tidyr") 
require("dplyr") 
require("stringr") 
require("caret") 
require("lubridate") 
require("rsample") 
require("recipes") 
require("themis") 

host <- "db" 
port <- Sys.getenv()["PG_PORT"] 
dbname <- Sys.getenv()["PG_DATABASE"] 
user <- Sys.getenv()["PG_USER"] 
password <- Sys.getenv()["PG_PASSWORD"] 

con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password) 
df_customer <- dbGetQuery(con,"SELECT * FROM customer") 
df_category <- dbGetQuery(con,"SELECT * FROM category") 
df_product <- dbGetQuery(con,"SELECT * FROM product") 
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt") 
df_store <- dbGetQuery(con,"SELECT * FROM store") 
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")
 

Loading required package: RPostgreSQL

Loading required package: DBI

Loading required package: tidyr

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: stringr

Loading required package: caret

Loading required package: ggplot2

Loading required package: lattice

Loading required package: lubridate


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


Loading required package: rsample

Loading required package: recipes


Attaching package: ‘recipes’


The following object is masked from ‘package:stringr’:

    fixed


The following object is masked from ‘package:stats’:

    step


Loading required package: themis

Registered S3 methods overwritten by 'themis':
  method                  from   
  bake.step_downsample    recipes
  bake.step_upsample      recipes
  prep.step_downsample    recipes
  prep.step_upsample      recipes
  tidy.step_downsample    recipes
  tidy.step_upsample      recipes
  tunable.step_downsample recipes
  tunable.step_upsample   recipes


Attaching package: ‘themis’


The following objects are masked from ‘package:recipes’:

    step_downsample, step_upsample

Practice Questions

R-081: Create new product data for the missing values of unit_price（unit_price） and unit_cost（unit_cost）, supplemented by the average value of each. Note that for the average values, round off values less than one yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.

In [ ]:

price_mean <- round(mean(df_product$unit_price, na.rm = TRUE)) 
cost_mean <- round(mean(df_product$unit_cost, na.rm = TRUE)) 
df_product_2 <- df_product %>% 
replace_na(list(unit_price = price_mean, unit_cost = cost_mean)) 
sapply(df_product_2, function(x) sum(is.na(x)))

product_cd: 0
category_major_cd: 0
category_medium_cd: 0
category_small_cd: 0
unit_price: 0
unit_cost: 0

Commentary :

This code performs the following actions:

Calculates the mean of the 'unit_price' and 'unit_cost' columns of the data frame 'df_product', rounding the results to the nearest integer.
price_mean <- round(mean(df_product$unit_price, na.rm = TRUE))
cost_mean <- round(mean(df_product$unit_cost, na.rm = TRUE))

round() function is used to round off the mean values to nearest integers.
na.rm=TRUE is used to remove any missing values from the calculations.

Replaces any missing values in the 'unit_price' and 'unit_cost' columns of 'df_product' with their respective mean values.
df_product_2 <- df_product %>%
replace_na(list(unit_price = price_mean, unit_cost = cost_mean))

replace_na() is used from the 'tidyr' package to replace missing values with the mean values calculated earlier.

The %>% operator (also known as the pipe operator) is used to chain the two actions together, where the output of the first action is used as input for the second action.

Uses the sapply() function to calculate the number of missing values in each column of the modified 'df_product_2' data frame.

sapply(df_product_2, function(x) sum(is.na(x)))
sapply() function is used to apply a function (function(x) sum(is.na(x))) to each column of 'df_product_2' data frame.

The sum(is.na(x)) part of the function counts the number of missing values (NA) in each column (x) of 'df_product_2' data frame.

R-082: Create new product data for missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the median value of each. Note that for the median value, round off the figures to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.

In [ ]:

price_median <- round(median(df_product$unit_price, na.rm = TRUE)) 
cost_median <- round(median(df_product$unit_cost, na.rm = TRUE))
df_product_3 <- df_product %>% 
replace_na(list(unit_price = price_median, unit_cost = cost_median)) 
sapply(df_product_3, function(x) sum(is.na(x)))

product_cd; 0
category_major_cd: 0
category_medium_cd: 0
category_small_cd: 0
unit_price: 0
unit_cost: 0

Commentary :

This code performs the following actions:

Calculates the median of the 'unit_price' and 'unit_cost' columns of the data frame 'df_product', rounding the results to the nearest integer.
price_median <- round(median(df_product$unit_price, na.rm = TRUE))
cost_median <- round(median(df_product$unit_cost, na.rm = TRUE))

median() function is used to calculate the median values of the 'unit_price' and 'unit_cost' columns.

round() function is used to round off the median values to nearest integers.

na.rm=TRUE is used to remove any missing values from the calculations.

Replaces any missing values in the 'unit_price' and 'unit_cost' columns of 'df_product' with their respective median values.
df_product_3 <- df_product %>%
replace_na(list(unit_price = price_median, unit_cost = cost_median))

replace_na() is used from the 'tidyr' package to replace missing values with the median values calculated earlier

The %>% operator (also known as the pipe operator) is used to chain the two actions together, where the output of the first action is used as input for the second action.

Uses the sapply() function to calculate the number of missing values in each column of the modified 'df_product_3' data frame.
sapply(df_product_3, function(x) sum(is.na(x)))

sapply() function is used to apply a function (function(x) sum(is.na(x))) to each column of 'df_product_3' data frame.

The sum(is.na(x)) part of the function counts the number of missing values (NA) in each column (x) of 'df_product_3' data frame.

R-083: Create new product data for the missing values of unit price(unit_price) and unit cost(unit cost), supplemented by the median value calculated for each product category sub-category code (category_small_cd). Note that the median value should be rounded to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no deficiencies for each item.

In [ ]:

df_product_4 <- df_product %>% 
group_by(category_small_cd) %>% 
summarise(price_median = round(median(unit_price, na.rm = TRUE)), cost_median = round(median(unit_cost, na.rm = TRUE)), .groups = "drop") %>% 

inner_join(df_product, by = "category_small_cd") %>% 
mutate(unit_price = ifelse(is.na(unit_price), price_median, unit_price), unit_cost = ifelse(is.na(unit_cost), cost_median, unit_cost)) sapply(df_product_4, function(x) sum(is.na(x)))

category_small_cd; 0
price_median: 0
cost_median: 0
product_cd: 0
category_major_cd: 0
category_medium_cd: 0
unit_price: 0
unit_cost: 0

Commentary :

This code performs the following actions:

Groups the 'df_product' data frame by the 'category_small_cd' column.
df_product_4 <- df_product %>%
group_by(category_small_cd)

%>% operator is used to chain the two actions together, where the output of the first action is used as input for the second action.

'group_by()' function from the 'dplyr' package is used to group the 'df_product' data frame by the 'category_small_cd' column.

Calculates the median of the 'unit_price' and 'unit_cost' columns for each group in the grouped data frame, rounding the results to the nearest integer, and summarizes the results into a new data frame.
summarise(price_median = round(median(unit_price, na.rm = TRUE)), cost_median = round(median(unit_cost, na.rm = TRUE)), .groups = "drop")

'summarise()' function is used from the 'dplyr' package to calculate the median of the 'unit_price' and 'unit_cost' columns for each group in the grouped data frame.

'round()' function is used to round off the median values to nearest integers.

na.rm=TRUE is used to remove any missing values from the calculations.

.groups="drop" argument is used to remove the grouping information from the output.

Joins the summarized data frame with the original 'df_product' data frame by the 'category_small_cd' column.
inner_join(df_product, by = "category_small_cd")

'inner_join()' function is used from the 'dplyr' package to join the summarized data frame with the original 'df_product' data frame by the 'category_small_cd' column.

The resulting data frame will have all the columns from both data frames.

Replaces any missing values in the 'unit_price' and 'unit_cost' columns of the joined data frame with their respective median values for the corresponding 'category_small_cd' group.
mutate(unit_price = ifelse(is.na(unit_price), price_median, unit_price), unit_cost = ifelse(is.na(unit_cost), cost_median, unit_cost))

'mutate()' function is used from the 'dplyr' package to create new columns with modified values in the joined data frame.

ifelse() function is used to check if a value is missing (NA) in the 'unit_price' and 'unit_cost' columns and replace it with the corresponding median value for the 'category_small_cd' group it belongs to.

Uses the sapply() function to calculate the number of missing values in each column of the modified 'df_product_4' data frame.
sapply(df_product_4, function(x) sum(is.na(x)))

sapply() function is used to apply a function (function(x) sum(is.na(x))) to each column of 'df_product_4' data frame.

The sum(is.na(x)) part of the function counts the number of missing values (NA) in each column (x) of 'df_product_4' data frame.

R-084: Prepare new data for all customers in the customer data (df_customer) by calculating the percentage of the 2019 sales value to the sales value for the whole period. However, if there are no sales results, treat them as 0. Then, extract the calculated percentages that exceed 0 and display the results for 10 cases. Also check that there are no missing data in the data created.

In [ ]:

df_receipt_2019 <- df_receipt %>% 
filter(20190101 <= sales_ymd & sales_ymd <= 20191231) %>% 
group_by(customer_id) %>% summarise(amount_2019 = sum(amount), .groups = "drop") 

df_receipt_all <- df_receipt %>% 
group_by(customer_id) %>% 
summarise(amount_all = sum(amount), .groups = "drop") 

df_sales_rate <- left_join(df_customer["customer_id"], df_receipt_2019, by = "customer_id") %>% 
left_join(df_receipt_all, by = "customer_id") %>% 
replace_na(list(amount_2019 = 0, amount_all = 0)) %>% 
mutate(amount_rate = ifelse(amount_all == 0, 0, amount_2019 / amount_all)) df_sales_rate %>% filter(amount_rate > 0) %>% 
slice(1:10)

A data.frame: 10 × 4

customer_id	amount_2019	amount_all	amount_rate
<chr>	<dbl>	<dbl>	<dbl>
CS031415000172	2971	5088	0.58392296
CS015414000103	874	3122	0.27994875
CS011215000048	248	3444	0.07200929
CS029415000023	3767	5167	0.72904974
CS035415000029	5823	7504	0.77598614
CS023513000066	208	771	0.26977951
CS035513000134	463	1565	0.29584665
CS001515000263	216	216	1.00000000
CS006415000279	229	229	1.00000000
CS031415000106	215	7741	0.02777419

In [ ]:

sapply(df_receipt_2019, function(x) sum(is.na(x)))

customer_id: 0
amount_2019: 0

Commentary :

This code performs some data manipulation on a data frame called df_receipt, which likely contains information about sales transactions. Here is a breakdown of what each line of code is doing:

df_receipt_2019 <- df_receipt %>%
  filter(20190101 <= sales_ymd & sales_ymd <= 20191231) %>%
  group_by(customer_id) %>% summarise(amount_2019 = sum(amount), .groups = "drop")

This line of code creates a new data frame called df_receipt_2019 which filters the original df_receipt data frame to only include sales transactions that occurred between January 1, 2019 and December 31, 2019. It then groups the data by customer_id and calculates the total amount spent by each customer during that time period, storing the result in a new column called amount_2019.

df_receipt_all <- df_receipt %>%
  group_by(customer_id) %>%
  summarise(amount_all = sum(amount), .groups = "drop")

This line of code creates a new data frame called df_receipt_all which groups the df_receipt data frame by customer_id and calculates the total amount spent by each customer across all transactions, storing the result in a new column called amount_all.

df_sales_rate <- left_join(df_customer["customer_id"], df_receipt_2019, by = "customer_id") %>%
  left_join(df_receipt_all, by = "customer_id") %>%
  replace_na(list(amount_2019 = 0, amount_all = 0)) %>%
  mutate(amount_rate = ifelse(amount_all == 0, 0, amount_2019 / amount_all)) %>%
  filter(amount_rate > 0) %>%
  slice(1:10)

This line of code creates a new data frame called df_sales_rate by joining together the df_customer data frame (which likely contains information about customer demographics) with the df_receipt_2019 and df_receipt_all data frames on the customer_id column. It then replaces any missing values in the amount_2019 and amount_all columns with 0, calculates a new column called amount_rate which represents the proportion of a customer's total spending that occurred in 2019 (i.e. amount_2019 / amount_all), filters the data to only include customers who spent money in 2019 (i.e. amount_rate > 0), and finally selects the top 10 customers by their amount_rate values (i.e. slice(1:10)).

Overall, this code is identifying the top 10 customers who spent the highest proportion of their total spending in 2019, which could be useful for targeted marketing or customer retention efforts.

R-085: Create new customer data by linking geocode data (df_geocode) to all customers in the customer data (df_customer) using postcode (postal_cd). However, if multiple longitude (longitude) and latitude (latitude) information is associated with a single postcode (postal_cd), calculate and use the average values of longitude (longitude) and latitude (latitude). Also, display 10 results to check the creation results.

In [ ]:

df_geocode_1 <- df_geocode[c("postal_cd", "longitude" ,"latitude")] %>% group_by(postal_cd) %>% 

summarise(m_longiture = mean(longitude), m_latitude = mean(latitude), .groups = "drop") 

df_customer_1 <- inner_join(df_customer, df_geocode_1, by = "postal_cd") head(df_customer_1, 10)

A data.frame: 10 × 13

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd	m_longiture	m_latitude
	<chr>	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>
1	CS021313000114	大野あや子	1	女性	1981-04-29	37	259-1113	神奈川県伊勢原市粟窪**********	S14021	20150905	0-00000000-0	139.3178	35.41358
2	CS037613000071	六角雅彦	9	不明	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0	139.8350	35.67193
3	CS031415000172	宇多田貴美子	1	女性	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C	139.6897	35.67374
4	CS028811000001	堀井かおり	1	女性	1933-03-27	86	245-0016	神奈川県横浜市泉区和泉町**********	S14028	20160115	0-00000000-0	139.4836	35.39125
5	CS001215000145	田崎美紀	1	女性	1995-03-29	24	144-0055	東京都大田区仲六郷**********	S13001	20170605	6-20090929-2	139.7078	35.54084
6	CS020401000016	宮下達士	0	男性	1974-09-15	44	174-0065	東京都板橋区若木**********	S13020	20150225	0-00000000-0	139.6724	35.77073
7	CS015414000103	奥野陽子	1	女性	1977-08-09	41	136-0073	東京都江東区北砂**********	S13015	20150722	B-20100609-B	139.8360	35.67818
8	CS029403000008	釈人志	0	男性	1973-08-17	45	279-0003	千葉県浦安市海楽**********	S12029	20150515	0-00000000-0	139.9047	35.65422
9	CS015804000004	松谷米蔵	0	男性	1931-05-02	87	136-0073	東京都江東区北砂**********	S13015	20150607	0-00000000-0	139.8360	35.67818
10	CS033513000180	安斎遥	1	女性	1962-07-11	56	241-0823	神奈川県横浜市旭区善部町**********	S14033	20150728	6-20080506-5	139.5146	35.45013

Commentary :

This code performs the following tasks:

It creates a new data frame called df_geocode_1 by selecting three columns from the df_geocode data frame (postal_cd, longitude, and latitude).

The df_geocode_1 data frame is grouped by postal_cd, and the mean values of longitude and latitude for each group are computed using the summarise() function. The resulting data frame contains one row for each unique value of postal_cd, with columns postal_cd, m_longitude, and m_latitude.

The df_customer data frame is joined with the df_geocode_1 data frame using the inner_join() function, which matches rows based on the common postal_cd column. The resulting data frame contains all columns from both data frames.

The head() function is used to display the first 10 rows of the resulting data frame (df_customer_1). These rows include information about the customers and their geographical locations (latitude and longitude).

R-086: For the customer data with latitude(latitude) and longitude(longitude) created in 085, combine it with shop data (df_store) using the member application shop code (application_store_cd) as key. Calculate the distance (in km) between the application shop and the customer address using the latitude(latitude) and longitude information (longitude) of the application shop and the latitude and longitude of the customer address (address), and display it together with the customer ID (customer_id), customer address and shop address (address). The following simple formula should be used as the calculation formula, but libraries using other highly accurate methods are also acceptable. Display 10 results.

Latitude（radians）：φLongitude（radians）：λDistance L=6371∗arccos⁡(sin⁡φ1∗sinφ2+cos⁡φ1∗cos⁡φ2∗cos⁡(λ1−λ2))

In [ ]:

calc_distance <- function(x1, y1, x2, y2) { distance <- 6371 * acos( sin(x1 * pi / 180) * sin(x2 * pi / 180) + cos(x1 * pi / 180) * cos(x2 * pi / 180) * cos((y1 * pi / 180) - (y2 * pi / 180 ))) print(class(distance)) # Check the CLASS of the calculation results just to be sure. return(distance) } 

inner_join(df_customer_1, df_store, by = c("application_store_cd" = "store_cd")) %>% 
rename(customer_address = address.x, store_address = address.y) %>% 
mutate(distance = calc_distance(m_latitude, m_longiture, latitude, longitude)) %>% select(customer_id, customer_address, store_address, distance) %>% 
slice(1:10)

[1] "numeric"

A data.frame: 10 × 4

customer_id	customer_address	store_address	distance
<chr>	<chr>	<chr>	<dbl>
CS021313000114	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.3944087
CS037613000071	東京都江東区南砂**********	東京都江東区南砂一丁目	1.4511822
CS031415000172	東京都渋谷区代々木**********	東京都渋谷区初台二丁目	0.4117335
CS028811000001	神奈川県横浜市泉区和泉町**********	神奈川県横浜市瀬谷区二ツ橋町	8.0651960
CS001215000145	東京都大田区仲六郷**********	東京都大田区仲六郷二丁目	1.2684210
CS020401000016	東京都板橋区若木**********	東京都北区十条仲原三丁目	4.1859046
CS015414000103	東京都江東区北砂**********	東京都江東区南砂二丁目	1.4496734
CS029403000008	千葉県浦安市海楽**********	千葉県浦安市東野一丁目	0.8048581
CS015804000004	東京都江東区北砂**********	東京都江東区南砂二丁目	1.4496734
CS033513000180	神奈川県横浜市旭区善部町**********	神奈川県横浜市瀬谷区阿久和西一丁目	1.9569470

Commentary :

This code performs the following tasks:

It defines a custom function called calc_distance() which takes four parameters x1, y1, x2, and y2. The function calculates the distance between two geographical locations (specified by their latitude and longitude) using the Haversine formula. The calculated distance is returned in kilometers.

The df_customer_1 data frame is joined with the df_store data frame using the inner_join() function. The join is performed on the application_store_cd column from df_customer_1 and the store_cd column from df_store.

The resulting data frame is then renamed to reflect the two different address columns, using the rename() function.

A new column called distance is added to the resulting data frame using the mutate() function. The calc_distance() function is used to calculate the distance between the customer and the store. The function is called with four arguments: the mean latitude and longitude values from df_customer_1 (m_latitude and m_longitude) and the latitude and longitude values from df_store. The resulting distance values are stored in the distance column.

The resulting data frame is then subsetted using the select() function to include only the customer_id, customer_address, store_address, and distance columns.

Finally, the slice() function is used to display the first 10 rows of the resulting data frame. These rows include information about the customers, their addresses, the store addresses, and the distances between the customers and the stores.

R-087: In the customer data (df_customer), the same customer is registered more than once, e.g. due to applications at different shops. Create the named customer data, in which customers with the same name (customer_name) and postcode (postal_cd) are regarded as the same customer and are collated so that there is one record per customer, and calculate the number of customer data, the number of named customer data and the number of duplicates. However, for the same customer, the one with the highest total sales amount shall be retained, and for customers with the same total sales amount or no sales record, the one with the lowest customer ID (customer_id) number shall be retained.

In [ ]:

df_sales_amount <- df_receipt %>% 
group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop")

df_customer_u <- left_join(df_customer, df_sales_amount, by = "customer_id") %>% mutate(sum_amount = ifelse(is.na(sum_amount), 0, sum_amount)) %>% arrange(desc(sum_amount), customer_id) %>% 
distinct(customer_name, postal_cd, .keep_all = TRUE) 

print(paste( "df_customer_cnt:", nrow(df_customer), "df_customer_u_cnt:", nrow(df_customer_u), "diff:", nrow(df_customer) - nrow(df_customer_u)))

[1] "df_customer_cnt: 21971 df_customer_u_cnt: 21941 diff: 30"

Commentary :

This code performs the following steps:

Calculate the total sales amount for each customer in the df_receipt dataframe and store the results in a new dataframe df_sales_amount using the group_by() and summarise() functions.

df_sales_amount <- df_receipt %>% group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop")

Perform a left join between the df_customer and df_sales_amount dataframes based on the customer_id column, then use the mutate() function to replace any NA values in the sum_amount column with 0, and sort the resulting dataframe by descending order of the sum_amount and customer_id columns, and keep only the unique rows based on the customer_name and postal_cd columns.

df_customer_u <- left_join(df_customer, df_sales_amount, by = "customer_id") %>% mutate(sum_amount = ifelse(is.na(sum_amount), 0, sum_amount)) %>% arrange(desc(sum_amount), customer_id) %>% distinct(customer_name, postal_cd, .keep_all = TRUE)

Print out the number of rows in the original df_customer dataframe, the resulting df_customer_u dataframe, and the difference between them to see how many duplicate rows were removed.

print(paste( "df_customer_cnt:", nrow(df_customer), "df_customer_u_cnt:", nrow(df_customer_u), "diff:", nrow(df_customer) - nrow(df_customer_u)))

R-088: Based on the data created in 087, create data with integrated name IDs for the customer data. However, the integrated name ID shall be assigned according to the following specifications.

Non-duplicated customers: set customer ID (customer_id)

Duplicate customers: set the customer ID of the record extracted in the previous question

The difference between the number of unique customer IDs and the number of unique integration name-drop IDs shall also be checked.

In [ ]:

df_customer_n = inner_join(df_customer,
df_customer_u[c("customer_id","customer_name","postal_cd")],
by = c("customer_name", "postal_cd")) %>%
rename(customer_id = customer_id.x,
integration_id = customer_id.y)

customer_id_cnt <- length(unique(df_customer_n$customer_id))
integration_id_cnt <- length(unique(df_customer_n$integration_id))

print(paste("Difference in number of IDs", customer_id_cnt - integration_id_cnt))

[1] "Difference in number of IDs 30"

Commentary :

This code performs the following operations:

It performs an inner join between df_customer and df_customer_u, using the columns "customer_name" and "postal_cd" as the join keys.
The resulting dataframe is assigned to df_customer_n.

The column "customer_id.x" in df_customer_n is renamed to "customer_id".

The column "customer_id.y" in df_customer_u is renamed to "integration_id".

The number of unique values in the "customer_id" and "integration_id" columns of df_customer_n is calculated and the difference between these counts is printed.

The purpose of this code is to compare the number of unique customer IDs in the original df_customer dataframe with the number of unique customer IDs in the df_customer_u dataframe, which was created by joining df_customer with the total sales amount per customer (df_sales_amount). The assumption is that there may be duplicate customer names and postal codes in df_customer, which would result in a larger number of unique customer IDs in df_customer compared to df_customer_u. The difference between the two counts printed at the end of the code represents the number of duplicate customer IDs that were removed by the join operation.

R-089: You want to split customers with sales records into training data and test data for building a forecasting model. Split the data randomly in the ratio of 8:2 respectively.

In [ ]:

set.seed(71) 

#Example using rsample initial_split 
df_sales_customer <- df_receipt %>% 
group_by(customer_id) %>% 
summarise(sum_amount = sum(amount), .groups = "drop") %>% 
filter(sum_amount > 0) 

df_tmp <- inner_join(df_customer, df_sales_customer, by = "customer_id") 
split <- initial_split(df_tmp, prop = 0.8) df_customer_train <- training(split) df_customer_test <- testing(split) 

# Confirmation 
print(paste("Percentage of training data:", nrow(df_customer_train) / nrow(df_tmp))) print(paste("Test Data Percentage:", nrow(df_customer_test) / nrow(df_tmp)))

[1] "Percentage of training data: 0.799903684083795"
[1] "Test Data Percentage: 0.200096315916205"

Commentary :

This code is splitting a dataset into training and testing sets using the rsample package in R.

set.seed(71) sets the random seed to ensure that the results are reproducible.

df_sales_customer is created by grouping df_receipt by customer_id and calculating the sum of amount for each customer. Then, only the customers with a positive sum are kept.

df_tmp is created by joining df_customer and df_sales_customer on customer_id.

split is created using initial_split() function from rsample package. The prop parameter is set to 0.8, which means that the dataset is split into 80% training and 20% testing data.

df_customer_train and df_customer_test are the training and testing sets, respectively, obtained from split.

Finally, print() statements are used to confirm the proportion of training and testing data.

R-090: The receipt details data (df_receipt) has data from 1 Jan 2017 to 31 Oct 2019. Create three sets of data for building time-series models, 12 months for training and 6 months for testing, by aggregating the sales amount (amount) on a monthly basis.

In [ ]:

#Example using caret's createTimeSlices. 
df_ts_amount <- df_receipt %>% group_by(substr(as.character(sales_ymd), 1, 6)) %>% summarise(sum_amount = sum(amount), .groups = "drop") 

colnames(df_ts_amount) <- c("sales_ym", "sales_amount") 

timeSlice <- createTimeSlices(df_ts_amount$sales_ym, initialWindow = 12, horizon = 6, skip = 5, fixedWindow = TRUE) 

df_train_1 <- df_ts_amount[timeSlice$train[[1]],] 
df_train_2 <- df_ts_amount[timeSlice$train[[2]],] 
df_train_3 <- df_ts_amount[timeSlice$train[[3]],] 

df_test_1 <- df_ts_amount[timeSlice$test[[1]],] 
df_test_2 <- df_ts_amount[timeSlice$test[[2]],] 
df_test_3 <- df_ts_amount[timeSlice$test[[3]],]

In [ ]:

# df_train_2 and df_train_3 display omitted 
df_train_1

A tibble: 12 × 2

sales_ym	sales_amount
<chr>	<int>
201701	902056
201702	764413
201703	962945
201704	847566
201705	884010
201706	894242
201707	959205
201708	954836
201709	902037
201710	905739
201711	932157
201712	939654

In [ ]:

# df_test_2 and df_test_3 display omitted 
df_test_1

A tibble: 6 × 2

sales_ym	sales_amount
<chr>	<int>
201801	944509
201802	864128
201803	946588
201804	937099
201805	1004438
201806	1012329

Commentary :

This code is an example of how to create time-based cross-validation folds using the createTimeSlices function from the caret package.

The first step of the code is to summarize the sales amount by month for the whole dataset using group_by and summarise functions. Then, the resulting data frame is renamed to have more meaningful column names.

The createTimeSlices function is then used to create three time-based folds for the data. The function takes the vector of time points (in this case, the sales month) and splits it into a set of training and testing indices for each fold. The initialWindow parameter specifies the size of the initial training window, the horizon parameter specifies the size of the testing window, the skip parameter specifies the amount of time to skip between each fold, and the fixedWindow parameter indicates whether the size of the testing window should be fixed. 

Finally, the df_train and df_test data frames are created for each fold using the indices generated by createTimeSlices. These data frames can be used for time-based cross-validation of a model.

R-091: For each customer in the customer data (df_customer), under-sample the number of customers with sales performance and the number of customers without sales performance so that the ratio is 1:1.

In [ ]:

#Example using recipes package. 
df_sales_amount <- df_receipt %>% 
group_by(customer_id) %>% 
summarise(sum_amount = sum(amount), .groups = "drop") %>% 
right_join(df_customer, by = "customer_id") %>% 
mutate(is_buy_flag = factor(ifelse(is.na(sum_amount), 0, 1))) 

df_down_sampling <- df_sales_amount %>% 
recipe() %>% 
step_downsample(is_buy_flag, seed = 71) %>% 
prep() %>% juice() 

df_down_sampling %>% 
group_by(is_buy_flag) %>% 
summarise(cnt = n(), .groups = "drop")

A tibble: 2 × 2

is_buy_flag	cnt
<fct>	<int>
0	8306
1	8306

Commentary :

The code is performing downsampling using the recipes package in R.

First, the total sales amount for each customer is calculated and joined with customer information. Then, a binary variable called is_buy_flag is created to indicate whether a customer has made a purchase or not based on the presence of sales amount.

Next, the recipe() function is used to create a recipe object. The step_downsample() function is then used to downsample the majority class (customers who did not make a purchase) to match the number of observations in the minority class (customers who made a purchase), using a random seed of 71.

The prep() function is used to prepare the recipe, and juice() function is used to extract the processed dataset. Finally, the resulting downsampling is verified by grouping the data by is_buy_flag and counting the number of observations in each group.

R-092: Normalise to third normal form for gender in customer data (df_customer).

In [ ]:

df_gender_std = unique(df_customer[c("gender_cd", "gender")]) df_customer_std = df_customer[, colnames(df_customer) != "gender"]

# Check data content 
head(df_customer_std, n = 3)

A data.frame: 3 × 10

	customer_id	customer_name	gender_cd	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>
1	CS021313000114	大野あや子	1	1981-04-29	37	259-1113	神奈川県伊勢原市粟窪**********	S14021	20150905	0-00000000-0
2	CS037613000071	六角雅彦	9	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0
3	CS031415000172	宇多田貴美子	1	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C

In [ ]:

# Check data content 
head(df_gender_std, n = 3)

A data.frame: 3 × 2

	gender_cd	gender
	<chr>	<chr>
1	1	女性
2	9	不明
6	0	男性

Commentary :

The code above is manipulating a data frame called "df_customer" that contains customer information. Here's what each line of code does:

df_gender_std = unique(df_customer[c("gender_cd", "gender")]) - This line of code creates a new data frame called df_gender_std. It selects two columns from the df_customer data frame, specifically the columns named "gender_cd" and "gender". Then, it uses the unique() function to remove any duplicate rows, creating a data frame that contains all unique combinations of gender codes and gender values in the original data frame.

df_customer_std = df_customer[, colnames(df_customer) != "gender"] - This line of code creates another new data frame called df_customer_std. It copies all columns from the df_customer data frame except for the "gender" column. This is done by selecting all columns whose names are not equal to "gender" using the colnames() function and the logical operator !=. The resulting data frame has the same rows as the original df_customer data frame but without the "gender" column.

Together, these two lines of code split the original df_customer data frame into two parts: one containing only unique gender codes and values, and one containing all other customer information except for the gender column.

R-093: Product data (df_product) only holds the code values for each category, but not the category names. Combine with category data (df_category) and denormalise to create new product data that holds category names.

In [ ]:

df_product_full <- inner_join(df_product, df_category[c("category_small_cd", "category_major_name", "category_medium_name", "category_small_name")], by = "category_small_cd") 

# Check data content 
head(df_product_full, n = 3)

A data.frame: 3 × 9

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
	<chr>	<chr>	<chr>	<chr>	<int>	<int>	<chr>	<chr>	<chr>
1	P040101001	04	0401	040101	198	149	惣菜	御飯類	弁当類
2	P040101002	04	0401	040101	218	164	惣菜	御飯類	弁当類
3	P040101003	04	0401	040101	230	173	惣菜	御飯類	弁当類

Commentary :

The code above is manipulating two data frames called "df_product" and "df_category". Here's what each line of code does:

df_product_full <- inner_join(df_product, df_category[c("category_small_cd", "category_major_name", "category_medium_name", "category_small_name")], by = "category_small_cd") - This line of code creates a new data frame called df_product_full. It uses the inner_join() function to combine the df_product and df_category data frames based on a common column named "category_small_cd". The result is a data frame that contains all columns from both data frames where the "category_small_cd" values match. The df_category data frame is subsetted to only include the columns "category_small_cd", "category_major_name", "category_medium_name", and "category_small_name".

head(df_product_full, n = 3) - This line of code is used to check the content of the df_product_full data frame. It displays the first three rows of the data frame using the head() function. This is a useful way to quickly inspect the resulting data frame to ensure that the join operation was successful and to check the format of the output.

R-094: File out the product data with category names created in 093 with the following specifications.

|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|UTF-8|

The path of the file output destination must be as follows.

|Output Destination| |:–:| |. /data|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file. 

write.csv(df_product_full, "../data/R_df_product_full_UTF-8_header.csv", row.names=FALSE, fileEncoding = "UTF-8")

Commentary :

The code above writes a data frame called df_product_full to a CSV file with UTF-8 encoding. Here's what each argument in the write.csv() function does:

df_product_full - This is the data frame that we want to write to a CSV file.

"../data/R_df_product_full_UTF-8_header.csv" - This is the file path and name of the output CSV file. The .. at the beginning of the path means that the file will be saved in the parent directory of the current working directory. R_df_product_full_UTF-8_header.csv is the name of the output file.

row.names=FALSE - This argument specifies that we do not want to include row names in the output CSV file.

fileEncoding = "UTF-8" - This argument specifies the encoding of the output CSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages, so it is a good choice for handling non-English characters.

In summary, the write.csv() function writes the df_product_full data frame to a CSV file with UTF-8 encoding, and saves it in the specified file path and name. The resulting CSV file will not include row names.

R-095: File out the product data with category names created in 093 with the following specifications.

|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|CP932|

The path of the file output destination should be as follows.

|Output destination| |:–:| |. /data|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file. 

write.csv(df_product_full, "../data/R_df_product_full_CP932_header.csv", row.names = FALSE, fileEncoding = "CP932")

Commentary :

The code above writes a data frame called df_product_full to a CSV file with CP932 encoding. Here's what each argument in the write.csv() function does:

df_product_full - This is the data frame that we want to write to a CSV file.

"../data/R_df_product_full_CP932_header.csv" - This is the file path and name of the output CSV file. The .. at the beginning of the path means that the file will be saved in the parent directory of the current working directory. R_df_product_full_CP932_header.csv is the name of the output file.

row.names=FALSE - This argument specifies that we do not want to include row names in the output CSV file.

fileEncoding = "CP932" - This argument specifies the encoding of the output CSV file. CP932 is a character encoding used for Japanese text, which supports a wide range of characters used in Japanese language.

In summary, the write.csv() function writes the df_product_full data frame to a CSV file with CP932 encoding, and saves it in the specified file path and name. The resulting CSV file will not include row names. This encoding is useful for handling Japanese text data.

R-096: File out the product data with category names created in 093 with the following specifications.

|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|Without|UTF-8|

The path of the file output destination must be as follows.

|Output Destination| |:–:| |. /data|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file. 

write.table(df_product_full, "../data/R_df_product_full_UTF-8_noh.csv", row.names = FALSE, col.names = FALSE, sep = ",", fileEncoding = "UTF-8")

Commentary :

The code above writes a data frame called df_product_full to a text file with UTF-8 encoding. Here's what each argument in the write.table() function does:

df_product_full - This is the data frame that we want to write to a text file.

"../data/R_df_product_full_UTF-8_noh.csv" - This is the file path and name of the output text file. The .. at the beginning of the path means that the file will be saved in the parent directory of the current working directory. R_df_product_full_UTF-8_noh.csv is the name of the output file.

row.names = FALSE - This argument specifies that we do not want to include row names in the output text file.

col.names = FALSE - This argument specifies that we do not want to include column names in the output text file.

sep = "," - This argument specifies the delimiter to be used between the columns in the output text file. In this case, a comma is used as the delimiter.

fileEncoding = "UTF-8" - This argument specifies the encoding of the output text file. UTF-8 is a character encoding that supports a wide range of characters from different languages, so it is a good choice for handling non-English characters.

In summary, the write.table() function writes the df_product_full data frame to a text file with UTF-8 encoding, and saves it in the specified file path and name. The resulting text file will not include row or column names and will have a comma as a delimiter.

R-097: Load the file in the following format created in 094, display three data items and check that they have been imported correctly.

|File format|Header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file. 

c_class <- c(NA, "character", "character", "character", NA, NA, NA, NA, NA) df_product_full <- read.csv("../data/R_df_product_full_UTF-8_header.csv", colClasses = c_class, fileEncoding = "UTF-8") 

head(df_product_full, 3)

A data.frame: 3 × 9

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
	<chr>	<chr>	<chr>	<chr>	<int>	<int>	<chr>	<chr>	<chr>
1	P040101001	04	0401	040101	198	149	惣菜	御飯類	弁当類
2	P040101002	04	0401	040101	218	164	惣菜	御飯類	弁当類
3	P040101003	04	0401	040101	230	173	惣菜	御飯類	弁当類

Commentary :

The code above reads a CSV file called R_df_product_full_UTF-8_header.csv into a data frame called df_product_full. Here's what each argument in the read.csv() function does:

"../data/R_df_product_full_UTF-8_header.csv" - This is the file path and name of the input CSV file. The .. at the beginning of the path means that the file is located in the parent directory of the current working directory.

colClasses = c_class - This argument specifies the class of each column in the data frame. c_class is a vector that specifies the class of each column in the data frame. In this case, the first and fifth to ninth columns are set to NA, indicating that their classes should be determined automatically. The second to fourth columns are set to character, indicating that they should be treated as character columns.

fileEncoding = "UTF-8" - This argument specifies the encoding of the input CSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages.

head(df_product_full, 3) - This command prints the first three rows of the df_product_full data frame to the console.

In summary, the read.csv() function reads the R_df_product_full_UTF-8_header.csv file into a data frame called df_product_full, using the specified column classes and file encoding. The resulting data frame is then printed to the console using the head() function.

R-098: Load the file in the following format created in 096, display three data items and check that they have been imported correctly.

|File format|Header|Character encoding| |:–:|:–:|:–:| |CSV (comma delimited)|Without header|UTF-8|

In [ ]:

# Code example 1 (item name to be added later) 
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file. 

c_class <- c(NA, "character", "character", "character", NA, NA, NA, NA, NA) 

df_product_full <- read.csv("../data/R_df_product_full_UTF-8_noh.csv", colClasses = c_class, header = FALSE, fileEncoding = "UTF-8") 

colnames(df_product_full) <- c("product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price", "unit_cost", "category_major_name", "category_medium_name", "category_small_name") 

head(df_product_full, 3)

A data.frame: 3 × 9

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
	<chr>	<chr>	<chr>	<chr>	<int>	<int>	<chr>	<chr>	<chr>
1	P040101001	04	0401	040101	198	149	惣菜	御飯類	弁当類
2	P040101002	04	0401	040101	218	164	惣菜	御飯類	弁当類
3	P040101003	04	0401	040101	230	173	惣菜	御飯類	弁当類

Commentary :

The code above reads a CSV file called R_df_product_full_UTF-8_noh.csv into a data frame called df_product_full, then assigns new column names to the data frame. Here's what each argument in the read.csv() function does:

"../data/R_df_product_full_UTF-8_noh.csv" - This is the file path and name of the input CSV file. The .. at the beginning of the path means that the file is located in the parent directory of the current working directory.

colClasses = c_class - This argument specifies the class of each column in the data frame. c_class is a vector that specifies the class of each column in the data frame. In this case, the first and fifth to ninth columns are set to NA, indicating that their classes should be determined automatically. The second to fourth columns are set to character, indicating that they should be treated as character columns.

header = FALSE - This argument specifies that the input CSV file does not have a header row.

fileEncoding = "UTF-8" - This argument specifies the encoding of the input CSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages.

colnames(df_product_full) <- c("product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price", "unit_cost", "category_major_name", "category_medium_name", "category_small_name") - This command assigns new column names to the data frame. The new column names are "product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price", "unit_cost", "category_major_name", "category_medium_name", and "category_small_name", in that order.

head(df_product_full, 3) - This command prints the first three rows of the df_product_full data frame to the console.

In summary, the read.csv() function reads the R_df_product_full_UTF-8_noh.csv file into a data frame called df_product_full, using the specified column classes, header setting, and file encoding. The resulting data frame is then given new column names and printed to the console using the head() function.

In [ ]:

# Code example 2 (define item name first) 
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file. 

c_names <- c("product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price","unit_cost", "category_major_name", "category_medium_name", "category_small_name") c_class <- c(NA, "character", "character", "character", NA, NA, NA, NA, NA) 

df_product_full <- read.csv("../data/R_df_product_full_UTF-8_noh.csv", col.names = c_names, colClasses = c_class, header = FALSE, fileEncoding = "UTF-8") 

head(df_product_full, 3)

A data.frame: 3 × 9

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
	<chr>	<chr>	<chr>	<chr>	<int>	<int>	<chr>	<chr>	<chr>
1	P040101001	04	0401	040101	198	149	惣菜	御飯類	弁当類
2	P040101002	04	0401	040101	218	164	惣菜	御飯類	弁当類
3	P040101003	04	0401	040101	230	173	惣菜	御飯類	弁当類

Commentary :

The code above reads a CSV file called R_df_product_full_UTF-8_noh.csv into a data frame called df_product_full, then assigns new column names to the data frame. Here's what each argument in the read.csv() function does:

"../data/R_df_product_full_UTF-8_noh.csv" - This is the file path and name of the input CSV file. The .. at the beginning of the path means that the file is located in the parent directory of the current working directory.

col.names = c_names - This argument specifies the column names to be used in the data frame. c_names is a character vector that lists the new column names in the same order as they appear in the input file.

colClasses = c_class - This argument specifies the class of each column in the data frame. c_class is a vector that specifies the class of each column in the data frame. In this case, the first column is set to NA, indicating that its class should be determined automatically. The second to fourth columns are set to character, indicating that they should be treated as character columns. The fifth and sixth columns are set to NA, indicating that their classes should be determined automatically. The last three columns are set to NA, indicating that they should be treated as missing values.

header = FALSE - This argument specifies that the input CSV file does not have a header row.

fileEncoding = "UTF-8" - This argument specifies the encoding of the input CSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages.

head(df_product_full, 3) - This command prints the first three rows of the df_product_full data frame to the console.

In summary, the read.csv() function reads the R_df_product_full_UTF-8_noh.csv file into a data frame called df_product_full, using the specified column names, column classes, header setting, and file encoding. The resulting data frame is then printed to the console using the head() function.

R-099: File out the product data with category names created in 093 with the following specifications.

|file format|with/without header|character encoding| |:–:|:–:|:–:| |TSV (Tab Separated Values)|With|UTF-8|

The path of the file output destination should be as follows.

|Output Destination| |:–:| |. /data|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file. 

write.table(df_product_full, "../data/R_df_product_full_UTF-8_header.tsv", row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")

Commentary :

The code above writes the contents of the data frame df_product_full to a tab-separated value (TSV) file called R_df_product_full_UTF-8_header.tsv. Here's what each argument in the write.table() function does:

df_product_full - This is the data frame to be written to the TSV file.

"../data/R_df_product_full_UTF-8_header.tsv" - This is the file path and name of the output TSV file. The .. at the beginning of the path means that the file will be located in the parent directory of the current working directory.

row.names = FALSE - This argument specifies that the row names should not be included in the output file.

sep = "\t" - This argument specifies the delimiter to be used between columns. In this case, the delimiter is a tab character.

fileEncoding = "UTF-8" - This argument specifies the encoding of the output TSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages.

In summary, the write.table() function writes the contents of the df_product_full data frame to a TSV file called R_df_product_full_UTF-8_header.tsv, using tabs as delimiters and UTF-8 as the file encoding.

R-100: Load the file in the following format created in 099, display three data items and check that they have been imported correctly.

|File format|Header|Character encoding| |:–:|:–:|:–:| |TSV (tab-separated)|With|UTF-8|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file. 

c_class <- c(NA, "character", "character", "character", NA, NA, NA, NA, NA) 

df_product_tmp <- read.table("../data/R_df_product_full_UTF-8_header.tsv", colClasses = c_class, header = TRUE, fileEncoding = "UTF-8") 

head(df_product_tmp,3)

A data.frame: 3 × 9

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
	<chr>	<chr>	<chr>	<chr>	<int>	<int>	<chr>	<chr>	<chr>
1	P040101001	04	0401	040101	198	149	惣菜	御飯類	弁当類
2	P040101002	04	0401	040101	218	164	惣菜	御飯類	弁当類
3	P040101003	04	0401	040101	230	173	惣菜	御飯類	弁当類

Commentary :

The code above reads in a tab-separated value (TSV) file called R_df_product_full_UTF-8_header.tsv and stores the contents in a data frame called df_product_tmp. Here's what each argument in the read.table() function does:

c_class - This is a vector specifying the classes of each column in the TSV file. In this case, the first column is left as NA, and the other columns are specified to be of class "character".

"../data/R_df_product_full_UTF-8_header.tsv" - This is the file path and name of the input TSV file.

colClasses = c_class - This argument specifies the column classes of the input TSV file.

header = TRUE - This argument specifies that the first row of the input file contains column names.

fileEncoding = "UTF-8" - This argument specifies the character encoding of the input TSV file.

After reading in the TSV file, the head() function is used to display the first 3 rows of the resulting df_product_tmp data frame.

Overall, this code is similar to the previous code that reads in a CSV file and stores the contents in the df_product_full data frame. However, this code reads in a TSV file and stores the contents in a temporary data frame called df_product_tmp. The purpose of using a temporary data frame may be to perform some data cleaning or manipulation before storing the final result in a data frame with a different name

This is the end of the 100. Well done!

Data Science 100 Knocks (Structured Data Processing) - R

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - R Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - R Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - R Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - R Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - R Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.