参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- To begin with, run the following cells
- Import the required libraries and read the data from the database (PostgreSQL)
- The libraries that are expected to be used are imported in the following cells
- If there are other libraries you wish to use, install them as appropriate using install.packages().
- Names, addresses, etc. are dummy data and are not real.
require("RPostgreSQL")
require("tidyr")
require("dplyr")
require("stringr")
require("caret")
require("lubridate")
require("rsample")
require("recipes")
require("themis")
host <- "db"
port <- Sys.getenv()["PG_PORT"]
dbname <- Sys.getenv()["PG_DATABASE"]
user <- Sys.getenv()["PG_USER"]
password <- Sys.getenv()["PG_PASSWORD"]
con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password)
df_customer <- dbGetQuery(con,"SELECT * FROM customer")
df_category <- dbGetQuery(con,"SELECT * FROM category")
df_product <- dbGetQuery(con,"SELECT * FROM product")
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt")
df_store <- dbGetQuery(con,"SELECT * FROM store")
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")
Loading required package: RPostgreSQL Loading required package: DBI Loading required package: tidyr Loading required package: dplyr Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: stringr Loading required package: caret Loading required package: ggplot2 Loading required package: lattice Loading required package: lubridate Attaching package: ‘lubridate’ The following objects are masked from ‘package:base’: date, intersect, setdiff, union Loading required package: rsample Loading required package: recipes Attaching package: ‘recipes’ The following object is masked from ‘package:stringr’: fixed The following object is masked from ‘package:stats’: step Loading required package: themis Registered S3 methods overwritten by 'themis': method from bake.step_downsample recipes bake.step_upsample recipes prep.step_downsample recipes prep.step_upsample recipes tidy.step_downsample recipes tidy.step_upsample recipes tunable.step_downsample recipes tunable.step_upsample recipes Attaching package: ‘themis’ The following objects are masked from ‘package:recipes’: step_downsample, step_upsample
Practice Questions
R-081: Create new product data for the missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the average value of each. Note that for the average values, round off values less than one yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
price_mean <- round(mean(df_product$unit_price, na.rm = TRUE))
cost_mean <- round(mean(df_product$unit_cost, na.rm = TRUE))
df_product_2 <- df_product %>%
replace_na(list(unit_price = price_mean, unit_cost = cost_mean))
sapply(df_product_2, function(x) sum(is.na(x)))
- product_cd
- 0
- category_major_cd
- 0
- category_medium_cd
- 0
- category_small_cd
- 0
- unit_price
- 0
- unit_cost
- 0
Commentary :
This code performs the following actions:
Calculates the mean of the 'unit_price' and 'unit_cost' columns of the data frame 'df_product', rounding the results to the nearest integer.
price_mean <- round(mean(df_product$unit_price, na.rm = TRUE))
cost_mean <- round(mean(df_product$unit_cost, na.rm = TRUE))
round()
function is used to round off the mean values to nearest integers.
na.rm=TRUE
is used to remove any missing values from the calculations.
Replaces any missing values in the 'unit_price' and 'unit_cost' columns of 'df_product' with their respective mean values.
df_product_2 <- df_product %>%
replace_na(list(unit_price = price_mean, unit_cost = cost_mean))
replace_na()
is used from the 'tidyr' package to replace missing values with the mean values calculated earlier.
The %>%
operator (also known as the pipe operator) is used to chain the two actions together, where the output of the first action is used as input for the second action.
Uses the sapply()
function to calculate the number of missing values in each column of the modified 'df_product_2' data frame.
sapply(df_product_2, function(x) sum(is.na(x)))
sapply()
function is used to apply a function (function(x) sum(is.na(x))
) to each column of 'df_product_2' data frame.
The sum(is.na(x))
part of the function counts the number of missing values (NA
) in each column (x
) of 'df_product_2' data frame.
R-082: Create new product data for missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the median value of each. Note that for the median value, round off the figures to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
price_median <- round(median(df_product$unit_price, na.rm = TRUE))
cost_median <- round(median(df_product$unit_cost, na.rm = TRUE))
df_product_3 <- df_product %>%
replace_na(list(unit_price = price_median, unit_cost = cost_median))
sapply(df_product_3, function(x) sum(is.na(x)))
- product_cd
- 0
- category_major_cd
- 0
- category_medium_cd
- 0
- category_small_cd
- 0
- unit_price
- 0
- unit_cost
- 0
Commentary :
This code performs the following actions:
Calculates the median of the 'unit_price' and 'unit_cost' columns of the data frame 'df_product', rounding the results to the nearest integer.
price_median <- round(median(df_product$unit_price, na.rm = TRUE))
cost_median <- round(median(df_product$unit_cost, na.rm = TRUE))
median()
function is used to calculate the median values of the 'unit_price' and 'unit_cost' columns.
round()
function is used to round off the median values to nearest integers.
na.rm=TRUE
is used to remove any missing values from the calculations.
Replaces any missing values in the 'unit_price' and 'unit_cost' columns of 'df_product' with their respective median values.
df_product_3 <- df_product %>%
replace_na(list(unit_price = price_median, unit_cost = cost_median))
replace_na()
is used from the 'tidyr' package to replace missing values with the median values calculated earlier
The %>%
operator (also known as the pipe operator) is used to chain the two actions together, where the output of the first action is used as input for the second action.
Uses the sapply()
function to calculate the number of missing values in each column of the modified 'df_product_3' data frame.
sapply(df_product_3, function(x) sum(is.na(x)))
sapply()
function is used to apply a function (function(x) sum(is.na(x))
) to each column of 'df_product_3' data frame.
The sum(is.na(x))
part of the function counts the number of missing values (NA
) in each column (x
) of 'df_product_3' data frame.
R-083: Create new product data for the missing values of unit price(unit_price) and unit cost(unit cost), supplemented by the median value calculated for each product category sub-category code (category_small_cd). Note that the median value should be rounded to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no deficiencies for each item.
df_product_4 <- df_product %>%
group_by(category_small_cd) %>%
summarise(price_median = round(median(unit_price, na.rm = TRUE)), cost_median = round(median(unit_cost, na.rm = TRUE)), .groups = "drop") %>%
inner_join(df_product, by = "category_small_cd") %>%
mutate(unit_price = ifelse(is.na(unit_price), price_median, unit_price), unit_cost = ifelse(is.na(unit_cost), cost_median, unit_cost)) sapply(df_product_4, function(x) sum(is.na(x)))
- category_small_cd
- 0
- price_median
- 0
- cost_median
- 0
- product_cd
- 0
- category_major_cd
- 0
- category_medium_cd
- 0
- unit_price
- 0
- unit_cost
- 0
Commentary :
This code performs the following actions:
Groups the 'df_product' data frame by the 'category_small_cd' column.
df_product_4 <- df_product %>%
group_by(category_small_cd)
%>%
operator is used to chain the two actions together, where the output of the first action is used as input for the second action.
'group_by()' function from the 'dplyr' package is used to group the 'df_product' data frame by the 'category_small_cd' column.
Calculates the median of the 'unit_price' and 'unit_cost' columns for each group in the grouped data frame, rounding the results to the nearest integer, and summarizes the results into a new data frame.
summarise(price_median = round(median(unit_price, na.rm = TRUE)), cost_median = round(median(unit_cost, na.rm = TRUE)), .groups = "drop")
'summarise()' function is used from the 'dplyr' package to calculate the median of the 'unit_price' and 'unit_cost' columns for each group in the grouped data frame.
'round()' function is used to round off the median values to nearest integers.
na.rm=TRUE
is used to remove any missing values from the calculations.
.groups="drop"
argument is used to remove the grouping information from the output.
Joins the summarized data frame with the original 'df_product' data frame by the 'category_small_cd' column.
inner_join(df_product, by = "category_small_cd")
'inner_join()' function is used from the 'dplyr' package to join the summarized data frame with the original 'df_product' data frame by the 'category_small_cd' column.
The resulting data frame will have all the columns from both data frames.
Replaces any missing values in the 'unit_price' and 'unit_cost' columns of the joined data frame with their respective median values for the corresponding 'category_small_cd' group.
mutate(unit_price = ifelse(is.na(unit_price), price_median, unit_price), unit_cost = ifelse(is.na(unit_cost), cost_median, unit_cost))
'mutate()' function is used from the 'dplyr' package to create new columns with modified values in the joined data frame.
ifelse()
function is used to check if a value is missing (NA
) in the 'unit_price' and 'unit_cost' columns and replace it with the corresponding median value for the 'category_small_cd' group it belongs to.
Uses the sapply()
function to calculate the number of missing values in each column of the modified 'df_product_4' data frame.
sapply(df_product_4, function(x) sum(is.na(x)))
sapply()
function is used to apply a function (function(x) sum(is.na(x))
) to each column of 'df_product_4' data frame.
The sum(is.na(x))
part of the function counts the number of missing values (NA
) in each column (x
) of 'df_product_4' data frame.
R-084: Prepare new data for all customers in the customer data (df_customer) by calculating the percentage of the 2019 sales value to the sales value for the whole period. However, if there are no sales results, treat them as 0. Then, extract the calculated percentages that exceed 0 and display the results for 10 cases. Also check that there are no missing data in the data created.
df_receipt_2019 <- df_receipt %>%
filter(20190101 <= sales_ymd & sales_ymd <= 20191231) %>%
group_by(customer_id) %>% summarise(amount_2019 = sum(amount), .groups = "drop")
df_receipt_all <- df_receipt %>%
group_by(customer_id) %>%
summarise(amount_all = sum(amount), .groups = "drop")
df_sales_rate <- left_join(df_customer["customer_id"], df_receipt_2019, by = "customer_id") %>%
left_join(df_receipt_all, by = "customer_id") %>%
replace_na(list(amount_2019 = 0, amount_all = 0)) %>%
mutate(amount_rate = ifelse(amount_all == 0, 0, amount_2019 / amount_all)) df_sales_rate %>% filter(amount_rate > 0) %>%
slice(1:10)
customer_id | amount_2019 | amount_all | amount_rate |
---|---|---|---|
<chr> | <dbl> | <dbl> | <dbl> |
CS031415000172 | 2971 | 5088 | 0.58392296 |
CS015414000103 | 874 | 3122 | 0.27994875 |
CS011215000048 | 248 | 3444 | 0.07200929 |
CS029415000023 | 3767 | 5167 | 0.72904974 |
CS035415000029 | 5823 | 7504 | 0.77598614 |
CS023513000066 | 208 | 771 | 0.26977951 |
CS035513000134 | 463 | 1565 | 0.29584665 |
CS001515000263 | 216 | 216 | 1.00000000 |
CS006415000279 | 229 | 229 | 1.00000000 |
CS031415000106 | 215 | 7741 | 0.02777419 |
sapply(df_receipt_2019, function(x) sum(is.na(x)))
- customer_id
- 0
- amount_2019
- 0
Commentary :
This code performs some data manipulation on a data frame called df_receipt
, which likely contains information about sales transactions. Here is a breakdown of what each line of code is doing:
df_receipt_2019 <- df_receipt %>%
filter(20190101 <= sales_ymd & sales_ymd <= 20191231) %>%
group_by(customer_id) %>% summarise(amount_2019 = sum(amount), .groups = "drop")
This line of code creates a new data frame called df_receipt_2019
which filters the original df_receipt
data frame to only include sales transactions that occurred between January 1, 2019 and December 31, 2019. It then groups the data by customer_id
and calculates the total amount spent by each customer during that time period, storing the result in a new column called amount_2019
.
df_receipt_all <- df_receipt %>%
group_by(customer_id) %>%
summarise(amount_all = sum(amount), .groups = "drop")
This line of code creates a new data frame called df_receipt_all
which groups the df_receipt
data frame by customer_id
and calculates the total amount spent by each customer across all transactions, storing the result in a new column called amount_all
.
df_sales_rate <- left_join(df_customer["customer_id"], df_receipt_2019, by = "customer_id") %>%
left_join(df_receipt_all, by = "customer_id") %>%
replace_na(list(amount_2019 = 0, amount_all = 0)) %>%
mutate(amount_rate = ifelse(amount_all == 0, 0, amount_2019 / amount_all)) %>%
filter(amount_rate > 0) %>%
slice(1:10)
This line of code creates a new data frame called df_sales_rate
by joining together the df_customer
data frame (which likely contains information about customer demographics) with the df_receipt_2019
and df_receipt_all
data frames on the customer_id
column. It then replaces any missing values in the amount_2019
and amount_all
columns with 0, calculates a new column called amount_rate
which represents the proportion of a customer's total spending that occurred in 2019 (i.e. amount_2019 / amount_all
), filters the data to only include customers who spent money in 2019 (i.e. amount_rate > 0
), and finally selects the top 10 customers by their amount_rate
values (i.e. slice(1:10)
).
Overall, this code is identifying the top 10 customers who spent the highest proportion of their total spending in 2019, which could be useful for targeted marketing or customer retention efforts.
R-085: Create new customer data by linking geocode data (df_geocode) to all customers in the customer data (df_customer) using postcode (postal_cd). However, if multiple longitude (longitude) and latitude (latitude) information is associated with a single postcode (postal_cd), calculate and use the average values of longitude (longitude) and latitude (latitude). Also, display 10 results to check the creation results.
df_geocode_1 <- df_geocode[c("postal_cd", "longitude" ,"latitude")] %>% group_by(postal_cd) %>%
summarise(m_longiture = mean(longitude), m_latitude = mean(latitude), .groups = "drop")
df_customer_1 <- inner_join(df_customer, df_geocode_1, by = "postal_cd") head(df_customer_1, 10)
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | m_longiture | m_latitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | |
1 | CS021313000114 | 大野 あや子 | 1 | 女性 | 1981-04-29 | 37 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150905 | 0-00000000-0 | 139.3178 | 35.41358 |
2 | CS037613000071 | 六角 雅彦 | 9 | 不明 | 1952-04-01 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 | 139.8350 | 35.67193 |
3 | CS031415000172 | 宇多田 貴美子 | 1 | 女性 | 1976-10-04 | 42 | 151-0053 | 東京都渋谷区代々木********** | S13031 | 20150529 | D-20100325-C | 139.6897 | 35.67374 |
4 | CS028811000001 | 堀井 かおり | 1 | 女性 | 1933-03-27 | 86 | 245-0016 | 神奈川県横浜市泉区和泉町********** | S14028 | 20160115 | 0-00000000-0 | 139.4836 | 35.39125 |
5 | CS001215000145 | 田崎 美紀 | 1 | 女性 | 1995-03-29 | 24 | 144-0055 | 東京都大田区仲六郷********** | S13001 | 20170605 | 6-20090929-2 | 139.7078 | 35.54084 |
6 | CS020401000016 | 宮下 達士 | 0 | 男性 | 1974-09-15 | 44 | 174-0065 | 東京都板橋区若木********** | S13020 | 20150225 | 0-00000000-0 | 139.6724 | 35.77073 |
7 | CS015414000103 | 奥野 陽子 | 1 | 女性 | 1977-08-09 | 41 | 136-0073 | 東京都江東区北砂********** | S13015 | 20150722 | B-20100609-B | 139.8360 | 35.67818 |
8 | CS029403000008 | 釈 人志 | 0 | 男性 | 1973-08-17 | 45 | 279-0003 | 千葉県浦安市海楽********** | S12029 | 20150515 | 0-00000000-0 | 139.9047 | 35.65422 |
9 | CS015804000004 | 松谷 米蔵 | 0 | 男性 | 1931-05-02 | 87 | 136-0073 | 東京都江東区北砂********** | S13015 | 20150607 | 0-00000000-0 | 139.8360 | 35.67818 |
10 | CS033513000180 | 安斎 遥 | 1 | 女性 | 1962-07-11 | 56 | 241-0823 | 神奈川県横浜市旭区善部町********** | S14033 | 20150728 | 6-20080506-5 | 139.5146 | 35.45013 |
Commentary :
This code performs the following tasks:
It creates a new data frame calleddf_geocode_1
by selecting three columns from thedf_geocode
data frame (postal_cd
,longitude
, andlatitude
).
Thedf_geocode_1
data frame is grouped bypostal_cd
, and the mean values oflongitude
andlatitude
for each group are computed using thesummarise()
function. The resulting data frame contains one row for each unique value ofpostal_cd
, with columnspostal_cd
,m_longitude
, andm_latitude
.
Thedf_customer
data frame is joined with thedf_geocode_1
data frame using theinner_join()
function, which matches rows based on the commonpostal_cd
column. The resulting data frame contains all columns from both data frames.
Thehead()
function is used to display the first 10 rows of the resulting data frame (df_customer_1
). These rows include information about the customers and their geographical locations (latitude and longitude).
R-086: For the customer data with latitude(latitude) and longitude(longitude) created in 085, combine it with shop data (df_store) using the member application shop code (application_store_cd) as key. Calculate the distance (in km) between the application shop and the customer address using the latitude(latitude) and longitude information (longitude) of the application shop and the latitude and longitude of the customer address (address), and display it together with the customer ID (customer_id), customer address and shop address (address). The following simple formula should be used as the calculation formula, but libraries using other highly accurate methods are also acceptable. Display 10 results.
Latitude(radians):φLongitude(radians):λDistance L=6371∗arccos(sinφ1∗sinφ2+cosφ1∗cosφ2∗cos(λ1−λ2))
calc_distance <- function(x1, y1, x2, y2) { distance <- 6371 * acos( sin(x1 * pi / 180) * sin(x2 * pi / 180) + cos(x1 * pi / 180) * cos(x2 * pi / 180) * cos((y1 * pi / 180) - (y2 * pi / 180 ))) print(class(distance)) # Check the CLASS of the calculation results just to be sure. return(distance) }
inner_join(df_customer_1, df_store, by = c("application_store_cd" = "store_cd")) %>%
rename(customer_address = address.x, store_address = address.y) %>%
mutate(distance = calc_distance(m_latitude, m_longiture, latitude, longitude)) %>% select(customer_id, customer_address, store_address, distance) %>%
slice(1:10)
[1] "numeric"
customer_id | customer_address | store_address | distance |
---|---|---|---|
<chr> | <chr> | <chr> | <dbl> |
CS021313000114 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.3944087 |
CS037613000071 | 東京都江東区南砂********** | 東京都江東区南砂一丁目 | 1.4511822 |
CS031415000172 | 東京都渋谷区代々木********** | 東京都渋谷区初台二丁目 | 0.4117335 |
CS028811000001 | 神奈川県横浜市泉区和泉町********** | 神奈川県横浜市瀬谷区二ツ橋町 | 8.0651960 |
CS001215000145 | 東京都大田区仲六郷********** | 東京都大田区仲六郷二丁目 | 1.2684210 |
CS020401000016 | 東京都板橋区若木********** | 東京都北区十条仲原三丁目 | 4.1859046 |
CS015414000103 | 東京都江東区北砂********** | 東京都江東区南砂二丁目 | 1.4496734 |
CS029403000008 | 千葉県浦安市海楽********** | 千葉県浦安市東野一丁目 | 0.8048581 |
CS015804000004 | 東京都江東区北砂********** | 東京都江東区南砂二丁目 | 1.4496734 |
CS033513000180 | 神奈川県横浜市旭区善部町********** | 神奈川県横浜市瀬谷区阿久和西一丁目 | 1.9569470 |
Commentary :
This code performs the following tasks:
It defines a custom function calledcalc_distance()
which takes four parametersx1
,y1
,x2
, andy2
. The function calculates the distance between two geographical locations (specified by their latitude and longitude) using the Haversine formula. The calculated distance is returned in kilometers.
Thedf_customer_1
data frame is joined with thedf_store
data frame using theinner_join()
function. The join is performed on theapplication_store_cd
column fromdf_customer_1
and thestore_cd
column fromdf_store
.
The resulting data frame is then renamed to reflect the two different address columns, using therename()
function.
A new column calleddistance
is added to the resulting data frame using themutate()
function. Thecalc_distance()
function is used to calculate the distance between the customer and the store. The function is called with four arguments: the mean latitude and longitude values fromdf_customer_1
(m_latitude
andm_longitude
) and the latitude and longitude values fromdf_store
. The resulting distance values are stored in thedistance
column.
The resulting data frame is then subsetted using theselect()
function to include only thecustomer_id
,customer_address
,store_address
, anddistance
columns.
Finally, theslice()
function is used to display the first 10 rows of the resulting data frame. These rows include information about the customers, their addresses, the store addresses, and the distances between the customers and the stores.
R-087: In the customer data (df_customer), the same customer is registered more than once, e.g. due to applications at different shops. Create the named customer data, in which customers with the same name (customer_name) and postcode (postal_cd) are regarded as the same customer and are collated so that there is one record per customer, and calculate the number of customer data, the number of named customer data and the number of duplicates. However, for the same customer, the one with the highest total sales amount shall be retained, and for customers with the same total sales amount or no sales record, the one with the lowest customer ID (customer_id) number shall be retained.
df_sales_amount <- df_receipt %>%
group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop")
df_customer_u <- left_join(df_customer, df_sales_amount, by = "customer_id") %>% mutate(sum_amount = ifelse(is.na(sum_amount), 0, sum_amount)) %>% arrange(desc(sum_amount), customer_id) %>%
distinct(customer_name, postal_cd, .keep_all = TRUE)
print(paste( "df_customer_cnt:", nrow(df_customer), "df_customer_u_cnt:", nrow(df_customer_u), "diff:", nrow(df_customer) - nrow(df_customer_u)))
[1] "df_customer_cnt: 21971 df_customer_u_cnt: 21941 diff: 30"
Commentary :
This code performs the following steps:
Calculate the total sales amount for each customer in the df_receipt dataframe and store the results in a new dataframe df_sales_amount using the group_by() and summarise() functions.
df_sales_amount <- df_receipt %>% group_by(customer_id) %>% summarise(sum_amount = sum(amount), .groups = "drop")
Perform a left join between the df_customer and df_sales_amount dataframes based on the customer_id column, then use the mutate() function to replace any NA values in the sum_amount column with 0, and sort the resulting dataframe by descending order of the sum_amount and customer_id columns, and keep only the unique rows based on the customer_name and postal_cd columns.
df_customer_u <- left_join(df_customer, df_sales_amount, by = "customer_id") %>% mutate(sum_amount = ifelse(is.na(sum_amount), 0, sum_amount)) %>% arrange(desc(sum_amount), customer_id) %>% distinct(customer_name, postal_cd, .keep_all = TRUE)
Print out the number of rows in the original df_customer dataframe, the resulting df_customer_u dataframe, and the difference between them to see how many duplicate rows were removed.
print(paste( "df_customer_cnt:", nrow(df_customer), "df_customer_u_cnt:", nrow(df_customer_u), "diff:", nrow(df_customer) - nrow(df_customer_u)))
R-088: Based on the data created in 087, create data with integrated name IDs for the customer data. However, the integrated name ID shall be assigned according to the following specifications.
- Non-duplicated customers: set customer ID (customer_id)
- Duplicate customers: set the customer ID of the record extracted in the previous question
The difference between the number of unique customer IDs and the number of unique integration name-drop IDs shall also be checked.
df_customer_n = inner_join(df_customer,
df_customer_u[c("customer_id","customer_name","postal_cd")],
by = c("customer_name", "postal_cd")) %>%
rename(customer_id = customer_id.x,
integration_id = customer_id.y)
customer_id_cnt <- length(unique(df_customer_n$customer_id))
integration_id_cnt <- length(unique(df_customer_n$integration_id))
print(paste("Difference in number of IDs", customer_id_cnt - integration_id_cnt))
[1] "Difference in number of IDs 30"
Commentary :
This code performs the following operations:
It performs an inner join betweendf_customer
anddf_customer_u
, using the columns "customer_name" and "postal_cd" as the join keys.
The resulting dataframe is assigned todf_customer_n
.
The column "customer_id.x" indf_customer_n
is renamed to "customer_id".
The column "customer_id.y" indf_customer_u
is renamed to "integration_id".
The number of unique values in the "customer_id" and "integration_id" columns ofdf_customer_n
is calculated and the difference between these counts is printed.
The purpose of this code is to compare the number of unique customer IDs in the originaldf_customer
dataframe with the number of unique customer IDs in thedf_customer_u
dataframe, which was created by joiningdf_customer
with the total sales amount per customer (df_sales_amount
). The assumption is that there may be duplicate customer names and postal codes indf_customer
, which would result in a larger number of unique customer IDs indf_customer
compared todf_customer_u
. The difference between the two counts printed at the end of the code represents the number of duplicate customer IDs that were removed by the join operation.
R-089: You want to split customers with sales records into training data and test data for building a forecasting model. Split the data randomly in the ratio of 8:2 respectively.
set.seed(71)
#Example using rsample initial_split
df_sales_customer <- df_receipt %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
filter(sum_amount > 0)
df_tmp <- inner_join(df_customer, df_sales_customer, by = "customer_id")
split <- initial_split(df_tmp, prop = 0.8) df_customer_train <- training(split) df_customer_test <- testing(split)
# Confirmation
print(paste("Percentage of training data:", nrow(df_customer_train) / nrow(df_tmp))) print(paste("Test Data Percentage:", nrow(df_customer_test) / nrow(df_tmp)))
[1] "Percentage of training data: 0.799903684083795"
[1] "Test Data Percentage: 0.200096315916205"
Commentary :
This code is splitting a dataset into training and testing sets using thersample
package in R.set.seed(71)
sets the random seed to ensure that the results are reproducible.df_sales_customer
is created by groupingdf_receipt
bycustomer_id
and calculating the sum ofamount
for each customer. Then, only the customers with a positive sum are kept.df_tmp
is created by joiningdf_customer
anddf_sales_customer
oncustomer_id
.split
is created usinginitial_split()
function fromrsample
package. Theprop
parameter is set to 0.8, which means that the dataset is split into 80% training and 20% testing data.df_customer_train
anddf_customer_test
are the training and testing sets, respectively, obtained fromsplit
.
Finally,print()
statements are used to confirm the proportion of training and testing data.
R-090: The receipt details data (df_receipt) has data from 1 Jan 2017 to 31 Oct 2019. Create three sets of data for building time-series models, 12 months for training and 6 months for testing, by aggregating the sales amount (amount) on a monthly basis.
#Example using caret's createTimeSlices.
df_ts_amount <- df_receipt %>% group_by(substr(as.character(sales_ymd), 1, 6)) %>% summarise(sum_amount = sum(amount), .groups = "drop")
colnames(df_ts_amount) <- c("sales_ym", "sales_amount")
timeSlice <- createTimeSlices(df_ts_amount$sales_ym, initialWindow = 12, horizon = 6, skip = 5, fixedWindow = TRUE)
df_train_1 <- df_ts_amount[timeSlice$train[[1]],]
df_train_2 <- df_ts_amount[timeSlice$train[[2]],]
df_train_3 <- df_ts_amount[timeSlice$train[[3]],]
df_test_1 <- df_ts_amount[timeSlice$test[[1]],]
df_test_2 <- df_ts_amount[timeSlice$test[[2]],]
df_test_3 <- df_ts_amount[timeSlice$test[[3]],]
# df_train_2 and df_train_3 display omitted
df_train_1
sales_ym | sales_amount |
---|---|
<chr> | <int> |
201701 | 902056 |
201702 | 764413 |
201703 | 962945 |
201704 | 847566 |
201705 | 884010 |
201706 | 894242 |
201707 | 959205 |
201708 | 954836 |
201709 | 902037 |
201710 | 905739 |
201711 | 932157 |
201712 | 939654 |
# df_test_2 and df_test_3 display omitted
df_test_1
sales_ym | sales_amount |
---|---|
<chr> | <int> |
201801 | 944509 |
201802 | 864128 |
201803 | 946588 |
201804 | 937099 |
201805 | 1004438 |
201806 | 1012329 |
Commentary :
This code is an example of how to create time-based cross-validation folds using thecreateTimeSlices
function from thecaret
package.
The first step of the code is to summarize the sales amount by month for the whole dataset usinggroup_by
andsummarise
functions. Then, the resulting data frame is renamed to have more meaningful column names.
ThecreateTimeSlices
function is then used to create three time-based folds for the data. The function takes the vector of time points (in this case, the sales month) and splits it into a set of training and testing indices for each fold. TheinitialWindow
parameter specifies the size of the initial training window, thehorizon
parameter specifies the size of the testing window, theskip
parameter specifies the amount of time to skip between each fold, and thefixedWindow
parameter indicates whether the size of the testing window should be fixed.
Finally, thedf_train
anddf_test
data frames are created for each fold using the indices generated bycreateTimeSlices
. These data frames can be used for time-based cross-validation of a model.
R-091: For each customer in the customer data (df_customer), under-sample the number of customers with sales performance and the number of customers without sales performance so that the ratio is 1:1.
#Example using recipes package.
df_sales_amount <- df_receipt %>%
group_by(customer_id) %>%
summarise(sum_amount = sum(amount), .groups = "drop") %>%
right_join(df_customer, by = "customer_id") %>%
mutate(is_buy_flag = factor(ifelse(is.na(sum_amount), 0, 1)))
df_down_sampling <- df_sales_amount %>%
recipe() %>%
step_downsample(is_buy_flag, seed = 71) %>%
prep() %>% juice()
df_down_sampling %>%
group_by(is_buy_flag) %>%
summarise(cnt = n(), .groups = "drop")
is_buy_flag | cnt |
---|---|
<fct> | <int> |
0 | 8306 |
1 | 8306 |
Commentary :
The code is performing downsampling using therecipes
package in R.
First, the total sales amount for each customer is calculated and joined with customer information. Then, a binary variable calledis_buy_flag
is created to indicate whether a customer has made a purchase or not based on the presence of sales amount.
Next, therecipe()
function is used to create a recipe object. Thestep_downsample()
function is then used to downsample the majority class (customers who did not make a purchase) to match the number of observations in the minority class (customers who made a purchase), using a random seed of 71.
Theprep()
function is used to prepare the recipe, andjuice()
function is used to extract the processed dataset. Finally, the resulting downsampling is verified by grouping the data byis_buy_flag
and counting the number of observations in each group.
R-092: Normalise to third normal form for gender in customer data (df_customer).
df_gender_std = unique(df_customer[c("gender_cd", "gender")]) df_customer_std = df_customer[, colnames(df_customer) != "gender"]
# Check data content
head(df_customer_std, n = 3)
customer_id | customer_name | gender_cd | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | CS021313000114 | 大野 あや子 | 1 | 1981-04-29 | 37 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150905 | 0-00000000-0 |
2 | CS037613000071 | 六角 雅彦 | 9 | 1952-04-01 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 |
3 | CS031415000172 | 宇多田 貴美子 | 1 | 1976-10-04 | 42 | 151-0053 | 東京都渋谷区代々木********** | S13031 | 20150529 | D-20100325-C |
# Check data content
head(df_gender_std, n = 3)
gender_cd | gender | |
---|---|---|
<chr> | <chr> | |
1 | 1 | 女性 |
2 | 9 | 不明 |
6 | 0 | 男性 |
Commentary :
The code above is manipulating a data frame called "df_customer" that contains customer information. Here's what each line of code does:df_gender_std = unique(df_customer[c("gender_cd", "gender")])
- This line of code creates a new data frame calleddf_gender_std
. It selects two columns from thedf_customer
data frame, specifically the columns named "gender_cd" and "gender". Then, it uses theunique()
function to remove any duplicate rows, creating a data frame that contains all unique combinations of gender codes and gender values in the original data frame.df_customer_std = df_customer[, colnames(df_customer) != "gender"]
- This line of code creates another new data frame calleddf_customer_std
. It copies all columns from thedf_customer
data frame except for the "gender" column. This is done by selecting all columns whose names are not equal to "gender" using thecolnames()
function and the logical operator!=
. The resulting data frame has the same rows as the originaldf_customer
data frame but without the "gender" column.
Together, these two lines of code split the originaldf_customer
data frame into two parts: one containing only unique gender codes and values, and one containing all other customer information except for the gender column.
R-093: Product data (df_product) only holds the code values for each category, but not the category names. Combine with category data (df_category) and denormalise to create new product data that holds category names.
df_product_full <- inner_join(df_product, df_category[c("category_small_cd", "category_major_name", "category_medium_name", "category_small_name")], by = "category_small_cd")
# Check data content
head(df_product_full, n = 3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | <chr> | <chr> | |
1 | P040101001 | 04 | 0401 | 040101 | 198 | 149 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101002 | 04 | 0401 | 040101 | 218 | 164 | 惣菜 | 御飯類 | 弁当類 |
3 | P040101003 | 04 | 0401 | 040101 | 230 | 173 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code above is manipulating two data frames called "df_product" and "df_category". Here's what each line of code does:df_product_full <- inner_join(df_product, df_category[c("category_small_cd", "category_major_name", "category_medium_name", "category_small_name")], by = "category_small_cd")
- This line of code creates a new data frame calleddf_product_full
. It uses theinner_join()
function to combine thedf_product
anddf_category
data frames based on a common column named "category_small_cd". The result is a data frame that contains all columns from both data frames where the "category_small_cd" values match. Thedf_category
data frame is subsetted to only include the columns "category_small_cd", "category_major_name", "category_medium_name", and "category_small_name".head(df_product_full, n = 3)
- This line of code is used to check the content of thedf_product_full
data frame. It displays the first three rows of the data frame using thehead()
function. This is a useful way to quickly inspect the resulting data frame to ensure that the join operation was successful and to check the format of the output.
R-094: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|UTF-8|
The path of the file output destination must be as follows.
|Output Destination| |:–:| |. /data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
write.csv(df_product_full, "../data/R_df_product_full_UTF-8_header.csv", row.names=FALSE, fileEncoding = "UTF-8")
Commentary :
The code above writes a data frame calleddf_product_full
to a CSV file with UTF-8 encoding. Here's what each argument in thewrite.csv()
function does:df_product_full
- This is the data frame that we want to write to a CSV file."../data/R_df_product_full_UTF-8_header.csv"
- This is the file path and name of the output CSV file. The..
at the beginning of the path means that the file will be saved in the parent directory of the current working directory.R_df_product_full_UTF-8_header.csv
is the name of the output file.row.names=FALSE
- This argument specifies that we do not want to include row names in the output CSV file.fileEncoding = "UTF-8"
- This argument specifies the encoding of the output CSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages, so it is a good choice for handling non-English characters.
In summary, thewrite.csv()
function writes thedf_product_full
data frame to a CSV file with UTF-8 encoding, and saves it in the specified file path and name. The resulting CSV file will not include row names.
R-095: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|With|CP932|
The path of the file output destination should be as follows.
|Output destination| |:–:| |. /data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
write.csv(df_product_full, "../data/R_df_product_full_CP932_header.csv", row.names = FALSE, fileEncoding = "CP932")
Commentary :
The code above writes a data frame calleddf_product_full
to a CSV file with CP932 encoding. Here's what each argument in thewrite.csv()
function does:df_product_full
- This is the data frame that we want to write to a CSV file."../data/R_df_product_full_CP932_header.csv"
- This is the file path and name of the output CSV file. The..
at the beginning of the path means that the file will be saved in the parent directory of the current working directory.R_df_product_full_CP932_header.csv
is the name of the output file.row.names=FALSE
- This argument specifies that we do not want to include row names in the output CSV file.fileEncoding = "CP932"
- This argument specifies the encoding of the output CSV file. CP932 is a character encoding used for Japanese text, which supports a wide range of characters used in Japanese language.
In summary, thewrite.csv()
function writes thedf_product_full
data frame to a CSV file with CP932 encoding, and saves it in the specified file path and name. The resulting CSV file will not include row names. This encoding is useful for handling Japanese text data.
R-096: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |CSV (comma-separated value)|Without|UTF-8|
The path of the file output destination must be as follows.
|Output Destination| |:–:| |. /data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
write.table(df_product_full, "../data/R_df_product_full_UTF-8_noh.csv", row.names = FALSE, col.names = FALSE, sep = ",", fileEncoding = "UTF-8")
Commentary :
The code above writes a data frame calleddf_product_full
to a text file with UTF-8 encoding. Here's what each argument in thewrite.table()
function does:df_product_full
- This is the data frame that we want to write to a text file."../data/R_df_product_full_UTF-8_noh.csv"
- This is the file path and name of the output text file. The..
at the beginning of the path means that the file will be saved in the parent directory of the current working directory.R_df_product_full_UTF-8_noh.csv
is the name of the output file.row.names = FALSE
- This argument specifies that we do not want to include row names in the output text file.col.names = FALSE
- This argument specifies that we do not want to include column names in the output text file.sep = ","
- This argument specifies the delimiter to be used between the columns in the output text file. In this case, a comma is used as the delimiter.fileEncoding = "UTF-8"
- This argument specifies the encoding of the output text file. UTF-8 is a character encoding that supports a wide range of characters from different languages, so it is a good choice for handling non-English characters.
In summary, thewrite.table()
function writes thedf_product_full
data frame to a text file with UTF-8 encoding, and saves it in the specified file path and name. The resulting text file will not include row or column names and will have a comma as a delimiter.
R-097: Load the file in the following format created in 094, display three data items and check that they have been imported correctly.
|File format|Header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
c_class <- c(NA, "character", "character", "character", NA, NA, NA, NA, NA) df_product_full <- read.csv("../data/R_df_product_full_UTF-8_header.csv", colClasses = c_class, fileEncoding = "UTF-8")
head(df_product_full, 3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | <chr> | <chr> | |
1 | P040101001 | 04 | 0401 | 040101 | 198 | 149 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101002 | 04 | 0401 | 040101 | 218 | 164 | 惣菜 | 御飯類 | 弁当類 |
3 | P040101003 | 04 | 0401 | 040101 | 230 | 173 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code above reads a CSV file calledR_df_product_full_UTF-8_header.csv
into a data frame calleddf_product_full
. Here's what each argument in theread.csv()
function does:"../data/R_df_product_full_UTF-8_header.csv"
- This is the file path and name of the input CSV file. The..
at the beginning of the path means that the file is located in the parent directory of the current working directory.colClasses = c_class
- This argument specifies the class of each column in the data frame.c_class
is a vector that specifies the class of each column in the data frame. In this case, the first and fifth to ninth columns are set toNA
, indicating that their classes should be determined automatically. The second to fourth columns are set tocharacter
, indicating that they should be treated as character columns.fileEncoding = "UTF-8"
- This argument specifies the encoding of the input CSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages.head(df_product_full, 3)
- This command prints the first three rows of thedf_product_full
data frame to the console.
In summary, theread.csv()
function reads theR_df_product_full_UTF-8_header.csv
file into a data frame calleddf_product_full
, using the specified column classes and file encoding. The resulting data frame is then printed to the console using thehead()
function.
R-098: Load the file in the following format created in 096, display three data items and check that they have been imported correctly.
|File format|Header|Character encoding| |:–:|:–:|:–:| |CSV (comma delimited)|Without header|UTF-8|
# Code example 1 (item name to be added later)
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
c_class <- c(NA, "character", "character", "character", NA, NA, NA, NA, NA)
df_product_full <- read.csv("../data/R_df_product_full_UTF-8_noh.csv", colClasses = c_class, header = FALSE, fileEncoding = "UTF-8")
colnames(df_product_full) <- c("product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price", "unit_cost", "category_major_name", "category_medium_name", "category_small_name")
head(df_product_full, 3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | <chr> | <chr> | |
1 | P040101001 | 04 | 0401 | 040101 | 198 | 149 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101002 | 04 | 0401 | 040101 | 218 | 164 | 惣菜 | 御飯類 | 弁当類 |
3 | P040101003 | 04 | 0401 | 040101 | 230 | 173 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code above reads a CSV file calledR_df_product_full_UTF-8_noh.csv
into a data frame calleddf_product_full
, then assigns new column names to the data frame. Here's what each argument in theread.csv()
function does:"../data/R_df_product_full_UTF-8_noh.csv"
- This is the file path and name of the input CSV file. The..
at the beginning of the path means that the file is located in the parent directory of the current working directory.colClasses = c_class
- This argument specifies the class of each column in the data frame.c_class
is a vector that specifies the class of each column in the data frame. In this case, the first and fifth to ninth columns are set toNA
, indicating that their classes should be determined automatically. The second to fourth columns are set tocharacter
, indicating that they should be treated as character columns.header = FALSE
- This argument specifies that the input CSV file does not have a header row.fileEncoding = "UTF-8"
- This argument specifies the encoding of the input CSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages.colnames(df_product_full) <- c("product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price", "unit_cost", "category_major_name", "category_medium_name", "category_small_name")
- This command assigns new column names to the data frame. The new column names are "product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price", "unit_cost", "category_major_name", "category_medium_name", and "category_small_name", in that order.head(df_product_full, 3)
- This command prints the first three rows of thedf_product_full
data frame to the console.
In summary, theread.csv()
function reads theR_df_product_full_UTF-8_noh.csv
file into a data frame calleddf_product_full
, using the specified column classes, header setting, and file encoding. The resulting data frame is then given new column names and printed to the console using thehead()
function.
# Code example 2 (define item name first)
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
c_names <- c("product_cd", "category_major_cd", "category_medium_cd", "category_small_cd", "unit_price","unit_cost", "category_major_name", "category_medium_name", "category_small_name") c_class <- c(NA, "character", "character", "character", NA, NA, NA, NA, NA)
df_product_full <- read.csv("../data/R_df_product_full_UTF-8_noh.csv", col.names = c_names, colClasses = c_class, header = FALSE, fileEncoding = "UTF-8")
head(df_product_full, 3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | <chr> | <chr> | |
1 | P040101001 | 04 | 0401 | 040101 | 198 | 149 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101002 | 04 | 0401 | 040101 | 218 | 164 | 惣菜 | 御飯類 | 弁当類 |
3 | P040101003 | 04 | 0401 | 040101 | 230 | 173 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code above reads a CSV file calledR_df_product_full_UTF-8_noh.csv
into a data frame calleddf_product_full
, then assigns new column names to the data frame. Here's what each argument in theread.csv()
function does:"../data/R_df_product_full_UTF-8_noh.csv"
- This is the file path and name of the input CSV file. The..
at the beginning of the path means that the file is located in the parent directory of the current working directory.col.names = c_names
- This argument specifies the column names to be used in the data frame.c_names
is a character vector that lists the new column names in the same order as they appear in the input file.colClasses = c_class
- This argument specifies the class of each column in the data frame.c_class
is a vector that specifies the class of each column in the data frame. In this case, the first column is set toNA
, indicating that its class should be determined automatically. The second to fourth columns are set tocharacter
, indicating that they should be treated as character columns. The fifth and sixth columns are set toNA
, indicating that their classes should be determined automatically. The last three columns are set toNA
, indicating that they should be treated as missing values.header = FALSE
- This argument specifies that the input CSV file does not have a header row.fileEncoding = "UTF-8"
- This argument specifies the encoding of the input CSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages.head(df_product_full, 3)
- This command prints the first three rows of thedf_product_full
data frame to the console.
In summary, theread.csv()
function reads theR_df_product_full_UTF-8_noh.csv
file into a data frame calleddf_product_full
, using the specified column names, column classes, header setting, and file encoding. The resulting data frame is then printed to the console using thehead()
function.
R-099: File out the product data with category names created in 093 with the following specifications.
|file format|with/without header|character encoding| |:–:|:–:|:–:| |TSV (Tab Separated Values)|With|UTF-8|
The path of the file output destination should be as follows.
|Output Destination| |:–:| |. /data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
write.table(df_product_full, "../data/R_df_product_full_UTF-8_header.tsv", row.names = FALSE, sep = "\t", fileEncoding = "UTF-8")
Commentary :
The code above writes the contents of the data framedf_product_full
to a tab-separated value (TSV) file calledR_df_product_full_UTF-8_header.tsv
. Here's what each argument in thewrite.table()
function does:df_product_full
- This is the data frame to be written to the TSV file."../data/R_df_product_full_UTF-8_header.tsv"
- This is the file path and name of the output TSV file. The..
at the beginning of the path means that the file will be located in the parent directory of the current working directory.row.names = FALSE
- This argument specifies that the row names should not be included in the output file.sep = "\t"
- This argument specifies the delimiter to be used between columns. In this case, the delimiter is a tab character.fileEncoding = "UTF-8"
- This argument specifies the encoding of the output TSV file. UTF-8 is a character encoding that supports a wide range of characters from different languages.
In summary, thewrite.table()
function writes the contents of thedf_product_full
data frame to a TSV file calledR_df_product_full_UTF-8_header.tsv
, using tabs as delimiters and UTF-8 as the file encoding.
R-100: Load the file in the following format created in 099, display three data items and check that they have been imported correctly.
|File format|Header|Character encoding| |:–:|:–:|:–:| |TSV (tab-separated)|With|UTF-8|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
c_class <- c(NA, "character", "character", "character", NA, NA, NA, NA, NA)
df_product_tmp <- read.table("../data/R_df_product_full_UTF-8_header.tsv", colClasses = c_class, header = TRUE, fileEncoding = "UTF-8")
head(df_product_tmp,3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | <chr> | <chr> | |
1 | P040101001 | 04 | 0401 | 040101 | 198 | 149 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101002 | 04 | 0401 | 040101 | 218 | 164 | 惣菜 | 御飯類 | 弁当類 |
3 | P040101003 | 04 | 0401 | 040101 | 230 | 173 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code above reads in a tab-separated value (TSV) file calledR_df_product_full_UTF-8_header.tsv
and stores the contents in a data frame calleddf_product_tmp
. Here's what each argument in theread.table()
function does:c_class
- This is a vector specifying the classes of each column in the TSV file. In this case, the first column is left asNA
, and the other columns are specified to be of class"character"
."../data/R_df_product_full_UTF-8_header.tsv"
- This is the file path and name of the input TSV file.colClasses = c_class
- This argument specifies the column classes of the input TSV file.header = TRUE
- This argument specifies that the first row of the input file contains column names.fileEncoding = "UTF-8"
- This argument specifies the character encoding of the input TSV file.
After reading in the TSV file, thehead()
function is used to display the first 3 rows of the resultingdf_product_tmp
data frame.
Overall, this code is similar to the previous code that reads in a CSV file and stores the contents in thedf_product_full
data frame. However, this code reads in a TSV file and stores the contents in a temporary data frame calleddf_product_tmp
. The purpose of using a temporary data frame may be to perform some data cleaning or manipulation before storing the final result in a data frame with a different name
This is the end of the 100. Well done!
Comment