参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_R.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Introduction

To begin with, run the following cells
Import the required libraries and read the data from the database (PostgreSQL)
The libraries that are expected to be used are imported in the following cells
If there are other libraries you wish to use, install them as appropriate using install.packages().
Names, addresses, etc. are dummy data and are not real.

In [ ]:

require("RPostgreSQL") 
require("tidyr") 
require("dplyr") 
require("stringr") 
require("caret") 
require("lubridate") 
require("rsample") 
require("recipes") 
require("themis") 

host <- "db" 
port <- Sys.getenv()["PG_PORT"] 
dbname <- Sys.getenv()["PG_DATABASE"] 
user <- Sys.getenv()["PG_USER"] 
password <- Sys.getenv()["PG_PASSWORD"] 

con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password) 
df_customer <- dbGetQuery(con,"SELECT * FROM customer") 
df_category <- dbGetQuery(con,"SELECT * FROM category") 
df_product <- dbGetQuery(con,"SELECT * FROM product") 
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt") 
df_store <- dbGetQuery(con,"SELECT * FROM store") 
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")

Loading required package: RPostgreSQL

Loading required package: DBI

Loading required package: tidyr

Loading required package: dplyr


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: stringr

Loading required package: caret

Loading required package: ggplot2

Loading required package: lattice

Loading required package: lubridate


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union


Loading required package: rsample

Loading required package: recipes


Attaching package: ‘recipes’


The following object is masked from ‘package:stringr’:

    fixed


The following object is masked from ‘package:stats’:

    step


Loading required package: themis

Registered S3 methods overwritten by 'themis':
  method                  from   
  bake.step_downsample    recipes
  bake.step_upsample      recipes
  prep.step_downsample    recipes
  prep.step_upsample      recipes
  tidy.step_downsample    recipes
  tidy.step_upsample      recipes
  tunable.step_downsample recipes
  tunable.step_upsample   recipes


Attaching package: ‘themis’


The following objects are masked from ‘package:recipes’:

    step_downsample, step_upsample

Practice Questions

R-001: Display the first 10 items of all items from the receipt details data (df_receipt) and visually check what data is held.

In [ ]:

head(df_receipt, n = 10)

A data.frame: 10 × 9

	sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount
	<int>	<int>	<chr>	<int>	<int>	<chr>	<chr>	<int>	<int>
1	20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158
2	20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81
3	20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170
4	20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25
5	20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90
6	20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138
7	20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30
8	20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128
9	20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770
10	20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680

Commentary :

The code head(df_receipt, n = 10) is used to display the first 10 rows of a data frame called df_receipt.

In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.

The head() function in R is used to display the first few rows of a data frame. By default, it displays the first 6 rows of the data frame. However, the n argument can be used to specify the number of rows to display. In this case, n = 10 is used to display the first 10 rows of the df_receipt data frame.

Therefore, when this code is executed, the output will show the first 10 rows of the df_receipt data frame.

R-002: Display 10 items from the receipt details data (df_receipt), specifying the columns in the order of sales_date (sales_ymd), customer_id (customer_id), product_code (product_cd) and sales amount (amount).

In [ ]:

head(df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], n = 10)

A data.frame: 10 × 4

	sales_ymd	customer_id	product_cd	amount
	<int>	<chr>	<chr>	<int>
1	20181103	CS006214000001	P070305012	158
2	20181118	CS008415000097	P070701017	81
3	20170712	CS028414000014	P060101005	170
4	20190205	ZZ000000000000	P050301001	25
5	20180821	CS025415000050	P060102007	90
6	20190605	CS003515000195	P050102002	138
7	20181205	CS024514000042	P080101005	30
8	20190922	CS040415000178	P070501004	128
9	20170504	ZZ000000000000	P071302010	770
10	20191010	CS027514000015	P071101003	680

Commentary :

The code head(df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], n = 10) is used to display the first 10 rows of a data frame called df_receipt, but only showing the columns "sales_ymd", "customer_id", "product_cd", and "amount".

In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.

The head() function in R is used to display the first few rows of a data frame. By default, it displays the first 6 rows of the data frame. However, the n argument can be used to specify the number of rows to display.

In this case, df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] is used to subset the df_receipt data frame and select only the columns "sales_ymd", "customer_id", "product_cd", and "amount". The c() function is used to concatenate the column names into a vector.

Therefore, when this code is executed, the output will show the first 10 rows of the df_receipt data frame, but only showing the columns "sales_ymd", "customer_id", "product_cd", and "amount". This can be useful when working with large data frames and only needing to focus on a subset of the columns.

R-003: From the receipt details data (df_receipt), specify the columns in order of sales_date (sales_ymd), customer_id (customer_id), product_code (product_cd) and sales_amount (amount) and display 10 items. However, extract while renaming the items from sales_ymd to sales_date.

In [ ]:

head(rename(df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], sales_date = sales_ymd), n = 10)

A data.frame: 10 × 4

	sales_date	customer_id	product_cd	amount
	<int>	<chr>	<chr>	<int>
1	20181103	CS006214000001	P070305012	158
2	20181118	CS008415000097	P070701017	81
3	20170712	CS028414000014	P060101005	170
4	20190205	ZZ000000000000	P050301001	25
5	20180821	CS025415000050	P060102007	90
6	20190605	CS003515000195	P050102002	138
7	20181205	CS024514000042	P080101005	30
8	20190922	CS040415000178	P070501004	128
9	20170504	ZZ000000000000	P071302010	770
10	20191010	CS027514000015	P071101003	680

Commentary :

The code head(rename(df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], sales_date = sales_ymd), n = 10) is used to display the first 10 rows of a data frame called df_receipt, but only showing the columns "sales_ymd", "customer_id", "product_cd", and "amount", with the column "sales_ymd" renamed to "sales_date".

In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.

The head() function in R is used to display the first few rows of a data frame. By default, it displays the first 6 rows of the data frame. However, the n argument can be used to specify the number of rows to display.

The rename() function in R is used to rename columns in a data frame. In this case, it is used to rename the column "sales_ymd" to "sales_date".

In this code, df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] is used to subset the df_receipt data frame and select only the columns "sales_ymd", "customer_id", "product_cd", and "amount". The c() function is used to concatenate the column names into a vector.

Then, the rename() function is applied to the subsetted data frame using the syntax rename(data_frame, new_column_name = old_column_name). The result is a data frame with the same rows and columns as df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], but with the column "sales_ymd" renamed to "sales_date".

Finally, the head() function is applied to the renamed data frame to show the first 10 rows.

Therefore, when this code is executed, the output will show the first 10 rows of the df_receipt data frame, but only showing the columns "sales_date", "customer_id", "product_cd", and "amount", with the column "sales_ymd" renamed to "sales_date".

R-004: From receipt details data (df_receipt), specify columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and extract data satisfying the following conditions.

Customer ID (customer_id) is “CS018205000001”

In [ ]:

df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>% 
filter(customer_id == "CS018205000001")

A data.frame: 12 × 4

sales_ymd	customer_id	product_cd	amount
<int>	<chr>	<chr>	<int>
20180911	CS018205000001	P071401012	2200
20180414	CS018205000001	P060104007	600
20170614	CS018205000001	P050206001	990
20170614	CS018205000001	P060702015	108
20190216	CS018205000001	P071005024	102
20180414	CS018205000001	P071101002	278
20190226	CS018205000001	P070902035	168
20190924	CS018205000001	P060805001	495
20190226	CS018205000001	P071401020	2200
20180911	CS018205000001	P071401005	1100
20190216	CS018205000001	P040101002	218
20190924	CS018205000001	P091503001	280

Commentary :

The code df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>% filter(customer_id == "CS018205000001") is used to subset the df_receipt data frame and select only the columns "sales_ymd", "customer_id", "product_cd", and "amount", and then filter the resulting data frame to only include rows where the "customer_id" column is equal to "CS018205000001".

In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.

The %>% operator in R is called the pipe operator and is used to chain together multiple functions into a single expression. In this case, it is used to pass the output of df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] into the filter() function.

The filter() function in R is used to select rows from a data frame that meet certain conditions. In this case, it is used to filter the data frame to only include rows where the "customer_id" column is equal to "CS018205000001".

Therefore, when this code is executed, the output will be a data frame that contains only the columns "sales_ymd", "customer_id", "product_cd", and "amount", and only includes rows where the "customer_id" column is equal to "CS018205000001". This can be useful when analyzing a large data frame and only needing to focus on a subset of the rows that meet certain criteria.

R-005: Extract data from receipt details data (df_receipt), specifying columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and satisfying all the following conditions.

Customer ID (customer_id) is “CS018205000001”.

Sales amount (amount) is 1,000 or more.

In [ ]:

df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>% 
filter(customer_id == "CS018205000001" & amount >= 1000)

A data.frame: 3 × 4

sales_ymd	customer_id	product_cd	amount
<int>	<chr>	<chr>	<int>
20180911	CS018205000001	P071401012	2200
20190226	CS018205000001	P071401020	2200
20180911	CS018205000001	P071401005	1100

Commentary :

The code df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>% filter(customer_id == "CS018205000001" & amount >= 1000) is used to subset the df_receipt data frame and select only the columns "sales_ymd", "customer_id", "product_cd", and "amount", and then filter the resulting data frame to only include rows where the "customer_id" column is equal to "CS018205000001" AND the "amount" column is greater than or equal to 1000.

In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.

The %>% operator in R is called the pipe operator and is used to chain together multiple functions into a single expression. In this case, it is used to pass the output of df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] into the filter() function.

The filter() function in R is used to select rows from a data frame that meet certain conditions. In this case, it is used to filter the data frame to only include rows where the "customer_id" column is equal to "CS018205000001" AND the "amount" column is greater than or equal to 1000. The & operator is used to combine the two conditions.

Therefore, when this code is executed, the output will be a data frame that contains only the columns "sales_ymd", "customer_id", "product_cd", and "amount", and only includes rows where the "customer_id" column is equal to "CS018205000001" AND the "amount" column is greater than or equal to 1000. This can be useful when analyzing a large data frame and only needing to focus on a subset of the rows that meet certain criteria.

R-006: Extract data from receipt details data (df_receipt), specifying the columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd), sales quantity (quantity) and sales amount (amount) and satisfying all the following conditions.

Customer ID (customer_id) is “CS018205000001”.

Sales amount (amount) is 1,000 or more or sales quantity (quantity) is 5 or more.

In [ ]:

df_receipt[c("sales_ymd", "customer_id", "product_cd", "quantity", "amount")] %>% filter(customer_id == "CS018205000001" & (amount >= 1000 | quantity >= 5))

A data.frame: 5 × 5

sales_ymd	customer_id	product_cd	quantity	amount
<int>	<chr>	<chr>	<int>	<int>
20180911	CS018205000001	P071401012	1	2200
20180414	CS018205000001	P060104007	6	600
20170614	CS018205000001	P050206001	5	990
20190226	CS018205000001	P071401020	1	2200
20180911	CS018205000001	P071401005	1	1100

Commentary :

The code df_receipt[c("sales_ymd", "customer_id", "product_cd", "quantity", "amount")] %>% filter(customer_id == "CS018205000001" & (amount >= 1000 | quantity >= 5)) is used to subset the df_receipt data frame and select only the columns "sales_ymd", "customer_id", "product_cd", "quantity", and "amount", and then filter the resulting data frame to only include rows where the "customer_id" column is equal to "CS018205000001" AND either the "amount" column is greater than or equal to 1000 OR the "quantity" column is greater than or equal to 5.

In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.

The %>% operator in R is called the pipe operator and is used to chain together multiple functions into a single expression. In this case, it is used to pass the output of df_receipt[c("sales_ymd", "customer_id", "product_cd", "quantity", "amount")] into the filter() function.

The filter() function in R is used to select rows from a data frame that meet certain conditions. In this case, it is used to filter the data frame to only include rows where the "customer_id" column is equal to "CS018205000001" AND either the "amount" column is greater than or equal to 1000 OR the "quantity" column is greater than or equal to 5. The & operator is used to combine the two conditions, and the | operator is used to specify an OR condition.

Therefore, when this code is executed, the output will be a data frame that contains only the columns "sales_ymd", "customer_id", "product_cd", "quantity", and "amount", and only includes rows where the "customer_id" column is equal to "CS018205000001" AND either the "amount" column is greater than or equal to 1000 OR the "quantity" column is greater than or equal to 5. This can be useful when analyzing a large data frame and only needing to focus on a subset of the rows that meet certain criteria.

R-007: From the receipt details data (df_receipt), specify the columns in order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and extract data that satisfy all the following conditions.

Customer ID (customer_id) is “CS018205000001”.

Sales amount (amount) is between 1,000 and 2,000.

In [ ]:

df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>% 
filter(customer_id == "CS018205000001" & between(amount, 1000, 2000))

A data.frame: 1 × 4

sales_ymd	customer_id	product_cd	amount
<int>	<chr>	<chr>	<int>
20180911	CS018205000001	P071401005	1100

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_receipt. Here's a breakdown of the code:

df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]: This is subsetting the df_receipt data frame to only include the columns "sales_ymd", "customer_id", "product_cd", and "amount".

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(customer_id == "CS018205000001" & between(amount, 1000, 2000)): This is filtering the data frame to only include rows where the customer ID is "CS018205000001" and the amount is between 1000 and 2000.

So, putting it all together, the code is subsetting the df_receipt data frame to only include the columns "sales_ymd", "customer_id", "product_cd", and "amount", and then filtering the resulting data frame to only include rows where the customer ID is "CS018205000001" and the amount is between 1000 and 2000.

R-008: From the receipt details data (df_receipt), specify the columns in order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount), and extract data satisfying all the following conditions.

Customer ID (customer_id) is “CS018205000001”.

Product code (product_cd) is other than “P071401019”.

In [ ]:

df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>% 
filter(customer_id == "CS018205000001" & product_cd != "P071401019")

A data.frame: 12 × 4

sales_ymd	customer_id	product_cd	amount
<int>	<chr>	<chr>	<int>
20180911	CS018205000001	P071401012	2200
20180414	CS018205000001	P060104007	600
20170614	CS018205000001	P050206001	990
20170614	CS018205000001	P060702015	108
20190216	CS018205000001	P071005024	102
20180414	CS018205000001	P071101002	278
20190226	CS018205000001	P070902035	168
20190924	CS018205000001	P060805001	495
20190226	CS018205000001	P071401020	2200
20180911	CS018205000001	P071401005	1100
20190216	CS018205000001	P040101002	218
20190924	CS018205000001	P091503001	280

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_receipt. Here's a breakdown of the code:

df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]: This is subsetting the df_receipt data frame to only include the columns "sales_ymd", "customer_id", "product_cd", and "amount".

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(customer_id == "CS018205000001" & product_cd != "P071401019"): This is filtering the data frame to only include rows where the customer ID is "CS018205000001" and the product code is not equal to "P071401019".

So, putting it all together, the code is subsetting the df_receipt data frame to only include the columns "sales_ymd", "customer_id", "product_cd", and "amount", and then filtering the resulting data frame to only include rows where the customer ID is "CS018205000001" and the product code is not equal to "P071401019".

R-009: In the following process, rewrite OR to AND without changing the output result.

df_store %>%filter(!(prefecture_cd == "13" | floor_area > 900))

In [ ]:

df_store %>% 
filter(prefecture_cd != "13" & floor_area <= 900)

A data.frame: 3 × 10

store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.5916	35.56189	831
S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.6316	35.54655	890
S12013	習志野店	12	千葉県	千葉県習志野市芝園一丁目	チバケンナラシノシシバゾノイッチョウメ	047-123-4002	140.0220	35.66122	808

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_store. Here's a breakdown of the code:

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(prefecture_cd != "13" & floor_area <= 900): This is filtering the data frame to only include rows where the prefecture code is not equal to "13" and the floor area is less than or equal to 900.

So, putting it all together, the code is filtering the df_store data frame to only include rows where the prefecture code is not equal to "13" and the floor area is less than or equal to 900.

R-010: From the shop data (df_store), extract all items whose shop code (store_cd) starts with “S14” and display 10 items.

In [ ]:

head(df_store %>% 
filter(startsWith(store_cd, "S14")), n = 10)

A data.frame: 10 × 10

	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
1	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.6326	35.50049	1732
2	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.4961	35.45918	1495
3	S14036	相模原中央店	14	神奈川県	神奈川県相模原市中央二丁目	カナガワケンサガミハラシチュウオウニチョウメ	042-123-4045	139.3716	35.57327	1679
4	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.4994	35.52398	1548
5	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.4961	35.45918	1830
6	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.4963	35.46304	1574
7	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.6582	35.42156	1341
8	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.5916	35.56189	831
9	S14022	逗子店	14	神奈川県	神奈川県逗子市逗子一丁目	カナガワケンズシシズシイッチョウメ	046-123-4036	139.5789	35.29642	1838
10	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.6316	35.54655	890

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_store. Here's a breakdown of the code:

df_store: This is the data frame being used.

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(startsWith(store_cd, "S14")): This is filtering the data frame to only include rows where the store_cd column starts with "S14".

head(., n = 10): This is selecting the first 10 rows of the resulting data frame using the head() function.

So, putting it all together, the code is filtering the df_store data frame to only include rows where the store_cd column starts with "S14", and then selecting the first 10 rows of the resulting data frame using the head() function.

R-011: From the customer data (df_customer), extract all items whose customer ID (customer_id) ends with 1, and display 10 items.

In [ ]:

head(df_customer %>% 
filter(endsWith(customer_id, "1")), n = 10)

A data.frame: 10 × 11

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	<chr>	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>
1	CS037613000071	六角雅彦	9	不明	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0
2	CS028811000001	堀井かおり	1	女性	1933-03-27	86	245-0016	神奈川県横浜市泉区和泉町**********	S14028	20160115	0-00000000-0
3	CS040412000191	川井郁恵	1	女性	1977-01-05	42	226-0021	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4
4	CS028314000011	小菅あおい	1	女性	1983-11-26	35	246-0038	神奈川県横浜市瀬谷区宮沢**********	S14028	20151123	1-20080426-5
5	CS039212000051	藤島恵梨香	1	女性	1997-02-03	22	166-0001	東京都杉並区阿佐谷北**********	S13039	20171121	1-20100215-4
6	CS015412000111	松居奈月	1	女性	1972-10-04	46	136-0071	東京都江東区亀戸**********	S13015	20150629	0-00000000-0
7	CS004702000041	野島洋	0	男性	1943-08-24	75	176-0022	東京都練馬区向山**********	S13004	20170218	0-00000000-0
8	CS041515000001	栗田千夏	1	女性	1967-01-02	52	206-0001	東京都多摩市和田**********	S13041	20160422	E-20100803-F
9	CS029313000221	北条ひかり	1	女性	1987-06-19	31	279-0011	千葉県浦安市美浜**********	S12029	20180810	0-00000000-0
10	CS034312000071	望月奈央	1	女性	1980-09-20	38	213-0026	神奈川県川崎市高津区久末**********	S14034	20160106	0-00000000-0

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_customer. Here's a breakdown of the code:

df_customer: This is the data frame being used.

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(endsWith(customer_id, "1")): This is filtering the data frame to only include rows where the customer_id column ends with "1".

head(., n = 10): This is selecting the first 10 rows of the resulting data frame using the head() function.

So, putting it all together, the code is filtering the df_customer data frame to only include rows where the customer_id column ends with "1", and then selecting the first 10 rows of the resulting data frame using the head() function.

R-012: From shop data (df_store), display all items with “横浜市(Yokohama-city)” in the address.

In [ ]:

df_store %>% 
filter(grepl("横浜市", address))

A data.frame: 11 × 10

store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.6326	35.50049	1732
S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.4961	35.45918	1495
S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.4994	35.52398	1548
S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.4961	35.45918	1830
S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.4963	35.46304	1574
S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.6582	35.42156	1341
S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.5916	35.56189	831
S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.6316	35.54655	890
S14048	中川中央店	14	神奈川県	神奈川県横浜市都筑区中川中央二丁目	カナガワケンヨコハマシツヅキクナカガワチュウオウニチョウメ	045-123-4051	139.5758	35.54912	1657
S14042	新山下店	14	神奈川県	神奈川県横浜市中区新山下二丁目	カナガワケンヨコハマシナカクシンヤマシタニチョウメ	045-123-4047	139.6593	35.43894	1044
S14006	葛が谷店	14	神奈川県	神奈川県横浜市都筑区葛が谷	カナガワケンヨコハマシツヅキククズガヤ	045-123-4031	139.5633	35.53573	1886

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_store. Here's a breakdown of the code:

df_store: This is the data frame being used.

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(grepl("横浜市", address)): This is filtering the data frame to only include rows where the address column contains the string "横浜市". The grepl() function returns a logical vector indicating whether the pattern "横浜市" is found in each element of the address column.

So, putting it all together, the code is filtering the df_store data frame to only include rows where the address column contains the string "横浜市".

R-013: From the customer data (df_customer), extract all data whose status code (status_cd) begins with the letters A to F and display 10 items.

In [ ]:

head(df_customer %>% 
filter(grepl("^[A-F]", status_cd)), n = 10)

A data.frame: 10 × 11

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	<chr>	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>
1	CS031415000172	宇多田貴美子	1	女性	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C
2	CS015414000103	奥野陽子	1	女性	1977-08-09	41	136-0073	東京都江東区北砂**********	S13015	20150722	B-20100609-B
3	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
4	CS029415000023	梅田里穂	1	女性	1976-01-17	43	279-0043	千葉県浦安市富士見**********	S12029	20150610	D-20100918-E
5	CS035415000029	寺沢真希	9	不明	1977-09-27	41	158-0096	東京都世田谷区玉川台**********	S13035	20141220	F-20101029-F
6	CS031415000106	宇野由美子	1	女性	1970-02-26	49	151-0053	東京都渋谷区代々木**********	S13031	20150201	F-20100511-E
7	CS029215000025	石倉美帆	1	女性	1993-09-28	25	279-0022	千葉県浦安市今川**********	S12029	20150708	B-20100820-C
8	CS033605000005	猪股雄太	0	男性	1955-12-05	63	246-0031	神奈川県横浜市瀬谷区瀬谷**********	S14033	20150425	F-20100917-E
9	CS033415000229	板垣菜々美	1	女性	1977-11-07	41	246-0021	神奈川県横浜市瀬谷区二ツ橋町**********	S14033	20150712	F-20100326-E
10	CS008415000145	黒谷麻緒	1	女性	1977-06-27	41	157-0067	東京都世田谷区喜多見**********	S13008	20150829	F-20100622-F

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_customer. Here's a breakdown of the code:

df_customer: This is the data frame being used.

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(grepl("^[A-F]", status_cd)): This is filtering the data frame to only include rows where the status_cd column starts with the letters A, B, C, D, E or F. The grepl() function returns a logical vector indicating whether the regular expression "^[A-F]" matches each element of the status_cd column. The regular expression "^[A-F]" matches any string that starts with the letters A, B, C, D, E or F.

head(., n = 10): This is selecting the first 10 rows of the resulting data frame using the head() function.

So, putting it all together, the code is filtering the df_customer data frame to only include rows where the status_cd column starts with the letters A, B, C, D, E or F, and then selecting the first 10 rows of the resulting data frame using the head() function.

R-014: From the customer data (df_customer), extract all data items whose status code (status_cd) ends with the numbers 1 to 9, and display 10 items.

In [ ]:

head(df_customer %>% 
filter(grepl("[1-9]$", status_cd)), n = 10)

A data.frame: 10 × 11

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	<chr>	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>
1	CS001215000145	田崎美紀	1	女性	1995-03-29	24	144-0055	東京都大田区仲六郷**********	S13001	20170605	6-20090929-2
2	CS033513000180	安斎遥	1	女性	1962-07-11	56	241-0823	神奈川県横浜市旭区善部町**********	S14033	20150728	6-20080506-5
3	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
4	CS040412000191	川井郁恵	1	女性	1977-01-05	42	226-0021	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4
5	CS009315000023	皆川文世	1	女性	1980-04-15	38	154-0012	東京都世田谷区駒沢**********	S13009	20150319	5-20080322-1
6	CS015315000033	福士璃奈子	1	女性	1983-03-17	36	135-0043	東京都江東区塩浜**********	S13015	20141024	4-20080219-3
7	CS023513000066	神戸そら	1	女性	1961-12-17	57	210-0005	神奈川県川崎市川崎区東田町**********	S14023	20150915	5-20100524-9
8	CS035513000134	市川美帆	1	女性	1960-03-27	59	156-0053	東京都世田谷区桜**********	S13035	20150227	8-20100711-9
9	CS001515000263	高松夏空	1	女性	1962-11-09	56	144-0051	東京都大田区西蒲田**********	S13001	20160812	1-20100804-1
10	CS040314000027	鶴田きみまろ	9	不明	1986-03-26	33	226-0027	神奈川県横浜市緑区長津田**********	S14040	20150122	2-20080426-4

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_customer. Here's a breakdown of the code:

df_customer: This is the data frame being used.

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(grepl("[1-9]$", status_cd)): This is filtering the data frame to only include rows where the status_cd column ends with a number from 1 to 9. The grepl() function returns a logical vector indicating whether the regular expression "[1-9]$" matches each element of the status_cd column. The regular expression "[1-9]$" matches any string that ends with a number from 1 to 9.

head(., n = 10): This is selecting the first 10 rows of the resulting data frame using the head() function.

So, putting it all together, the code is filtering the df_customer data frame to only include rows where the status_cd column ends with a number from 1 to 9, and then selecting the first 10 rows of the resulting data frame using the head() function.

R-015: From the customer data (df_customer), extract all data items whose status code (status_cd) begins with the letters A to F and ends with the numbers 1 to 9, and display 10 items.

In [ ]:

head(df_customer %>% 
filter(grepl("^[A-F].*[1-9]$", status_cd)), n = 10)

A data.frame: 10 × 11

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	<chr>	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>
1	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
2	CS022513000105	島村貴美子	1	女性	1962-03-12	57	249-0002	神奈川県逗子市山の根**********	S14022	20150320	A-20091115-7
3	CS001515000096	水野陽子	9	不明	1960-11-29	58	144-0053	東京都大田区蒲田本町**********	S13001	20150614	A-20100724-7
4	CS013615000053	西脇季衣	1	女性	1953-10-18	65	261-0026	千葉県千葉市美浜区幕張西**********	S12013	20150128	B-20100329-6
5	CS020412000161	小宮薫	1	女性	1974-05-21	44	174-0042	東京都板橋区東坂下**********	S13020	20150822	B-20081021-3
6	CS001215000097	竹中あさみ	1	女性	1990-07-25	28	146-0095	東京都大田区多摩川**********	S13001	20170315	A-20100211-2
7	CS035212000007	内村恵梨香	1	女性	1990-12-04	28	152-0023	東京都目黒区八雲**********	S13035	20151013	B-20101018-6
8	CS002515000386	野田コウ	1	女性	1963-05-30	55	185-0013	東京都国分寺市西恋ケ窪**********	S13002	20160410	C-20100127-8
9	CS001615000372	稲垣寿々花	1	女性	1956-10-29	62	144-0035	東京都大田区南蒲田**********	S13001	20170403	A-20100104-1
10	CS032512000121	松井知世	1	女性	1962-09-04	56	210-0011	神奈川県川崎市川崎区富士見**********	S13032	20150727	A-20100103-5

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_customer. Here's a breakdown of the code:

df_customer: This is the data frame being used.

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(grepl("^[A-F].*[1-9]$", status_cd)): This is filtering the data frame to only include rows where the status_cd column starts with the letters A, B, C, D, E or F, and ends with a number from 1 to 9. The grepl() function returns a logical vector indicating whether the regular expression "^[A-F].[1-9]$" matches each element of the status_cd column. The regular expression "^[A-F].[1-9]$" matches any string that starts with the letters A, B, C, D, E or F, followed by zero or more characters of any type, and ends with a number from 1 to 9.

head(., n = 10): This is selecting the first 10 rows of the resulting data frame using the head() function.

So, putting it all together, the code is filtering the df_customer data frame to only include rows where the status_cd column starts with the letters A, B, C, D, E or F, and ends with a number from 1 to 9, and then selecting the first 10 rows of the resulting data frame using the head() function.

R-016: From the shop data (df_store), display all items of data where the telephone number (tel_no) is 3 digits – 3 digits – 4 digits

In [ ]:

df_store %>% 
filter(grepl("^[0-9]{3}-[0-9]{3}-[0-9]{4}$", tel_no))

A data.frame: 34 × 10

store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
S12014	千草台店	12	千葉県	千葉県千葉市稲毛区千草台一丁目	チバケンチバシイナゲクチグサダイイッチョウメ	043-123-4003	140.1180	35.63559	1698
S13002	国分寺店	13	東京都	東京都国分寺市本多二丁目	トウキョウトコクブンジシホンダニチョウメ	042-123-4008	139.4802	35.70566	1735
S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.6326	35.50049	1732
S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.4961	35.45918	1495
S14036	相模原中央店	14	神奈川県	神奈川県相模原市中央二丁目	カナガワケンサガミハラシチュウオウニチョウメ	042-123-4045	139.3716	35.57327	1679
S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.4994	35.52398	1548
S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.4961	35.45918	1830
S13052	森野店	13	東京都	東京都町田市森野三丁目	トウキョウトマチダシモリノサンチョウメ	042-123-4030	139.4383	35.55293	1087
S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.4963	35.46304	1574
S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.6582	35.42156	1341
S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.5916	35.56189	831
S14022	逗子店	14	神奈川県	神奈川県逗子市逗子一丁目	カナガワケンズシシズシイッチョウメ	046-123-4036	139.5789	35.29642	1838
S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.6316	35.54655	890
S13016	小金井店	13	東京都	東京都小金井市本町一丁目	トウキョウトコガネイシホンチョウイッチョウメ	042-123-4015	139.5094	35.70018	1399
S14034	川崎野川店	14	神奈川県	神奈川県川崎市宮前区野川	カナガワケンカワサキシミヤマエクノガワ	044-123-4044	139.5998	35.57693	1318
S14048	中川中央店	14	神奈川県	神奈川県横浜市都筑区中川中央二丁目	カナガワケンヨコハマシツヅキクナカガワチュウオウニチョウメ	045-123-4051	139.5758	35.54912	1657
S12007	佐倉店	12	千葉県	千葉県佐倉市上志津	チバケンサクラシカミシヅ	043-123-4001	140.1452	35.71872	1895
S14026	辻堂西海岸店	14	神奈川県	神奈川県藤沢市辻堂西海岸二丁目	カナガワケンフジサワシツジドウニシカイガンニチョウメ	046-123-4040	139.4466	35.32464	1732
S13041	八王子店	13	東京都	東京都八王子市大塚	トウキョウトハチオウジシオオツカ	042-123-4026	139.4235	35.63787	810
S14049	川崎大師店	14	神奈川県	神奈川県川崎市川崎区中瀬三丁目	カナガワケンカワサキシカワサキクナカゼサンチョウメ	044-123-4052	139.7327	35.53759	962
S14023	川崎店	14	神奈川県	神奈川県川崎市川崎区本町二丁目	カナガワケンカワサキシカワサキクホンチョウニチョウメ	044-123-4037	139.7028	35.53599	1804
S13018	清瀬店	13	東京都	東京都清瀬市松山一丁目	トウキョウトキヨセシマツヤマイッチョウメ	042-123-4017	139.5178	35.76885	1220
S14027	南藤沢店	14	神奈川県	神奈川県藤沢市南藤沢	カナガワケンフジサワシミナミフジサワ	046-123-4041	139.4896	35.33762	1521
S14021	伊勢原店	14	神奈川県	神奈川県伊勢原市伊勢原四丁目	カナガワケンイセハラシイセハラヨンチョウメ	046-123-4035	139.3129	35.40169	962
S14047	相模原店	14	神奈川県	神奈川県相模原市千代田六丁目	カナガワケンサガミハラシチヨダロクチョウメ	042-123-4050	139.3748	35.55959	1047
S12013	習志野店	12	千葉県	千葉県習志野市芝園一丁目	チバケンナラシノシシバゾノイッチョウメ	047-123-4002	140.0220	35.66122	808
S14042	新山下店	14	神奈川県	神奈川県横浜市中区新山下二丁目	カナガワケンヨコハマシナカクシンヤマシタニチョウメ	045-123-4047	139.6593	35.43894	1044
S12030	八幡店	12	千葉県	千葉県市川市八幡三丁目	チバケンイチカワシヤワタサンチョウメ	047-123-4005	139.9240	35.72318	1162
S14025	大和店	14	神奈川県	神奈川県大和市下和田	カナガワケンヤマトシシモワダ	046-123-4039	139.4680	35.43414	1011
S14045	厚木店	14	神奈川県	神奈川県厚木市中町二丁目	カナガワケンアツギシナカチョウニチョウメ	046-123-4048	139.3651	35.44182	980
S12029	東野店	12	千葉県	千葉県浦安市東野一丁目	チバケンウラヤスシヒガシノイッチョウメ	047-123-4004	139.8968	35.65086	1101
S12053	高洲店	12	千葉県	千葉県浦安市高洲五丁目	チバケンウラヤスシタカスゴチョウメ	047-123-4006	139.9176	35.63755	1555
S14024	三田店	14	神奈川県	神奈川県川崎市多摩区三田四丁目	カナガワケンカワサキシタマクミタヨンチョウメ	044-123-4038	139.5424	35.60770	972
S14006	葛が谷店	14	神奈川県	神奈川県横浜市都筑区葛が谷	カナガワケンヨコハマシツヅキククズガヤ	045-123-4031	139.5633	35.53573	1886

Commentary :

This code is using the dplyr package in R to manipulate a data frame called df_store. Here's a breakdown of the code:

df_store: This is the data frame being used.

%>%: This is the pipe operator in dplyr, which allows you to chain multiple data manipulation functions together.

filter(grepl("^[0-9]{3}-[0-9]{3}-[0-9]{4}$", tel_no)): This is filtering the data frame to only include rows where the tel_no column matches a specific phone number format. The grepl() function returns a logical vector indicating whether the regular expression "^[0-9]{3}-[0-9]{3}-[0-9]{4}$" matches each element of the tel_no column. The regular expression "^[0-9]{3}-[0-9]{3}-[0-9]{4}$" matches any string that starts with three digits, followed by a hyphen, then three more digits, another hyphen, and finally four more digits.

head(.): This is selecting the first 6 rows of the resulting data frame using the head() function.

So, putting it all together, the code is filtering the df_store data frame to only include rows where the tel_no column matches a specific phone number format, and then selecting the first 6 rows of the resulting data frame using the head() function.

R-017: Sort customer data (df_customer) by birth_day(birth_day) in order of age and display all 10 items from the top.

In [ ]:

head( df_customer[order(df_customer$birth_day), ], n = 10 )

A data.frame: 10 × 11

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	<chr>	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>
18818	CS003813000014	村山菜々美	1	女性	1928-11-26	90	182-0007	東京都調布市菊野台**********	S13003	20160214	0-00000000-0
12329	CS026813000004	吉村朝陽	1	女性	1928-12-14	90	251-0043	神奈川県藤沢市辻堂元町**********	S14026	20150723	0-00000000-0
15683	CS018811000003	熊沢美里	1	女性	1929-01-07	90	204-0004	東京都清瀬市野塩**********	S13018	20150403	0-00000000-0
15303	CS027803000004	内村拓郎	0	男性	1929-01-12	90	251-0031	神奈川県藤沢市鵠沼藤が谷**********	S14027	20151227	0-00000000-0
1682	CS013801000003	天野拓郎	0	男性	1929-01-15	90	274-0824	千葉県船橋市前原東**********	S12013	20160120	0-00000000-0
7512	CS001814000022	鶴田里穂	1	女性	1929-01-28	90	144-0045	東京都大田区南六郷**********	S13001	20161012	A-20090415-7
2379	CS016815000002	山元美紀	1	女性	1929-02-22	90	184-0005	東京都小金井市桜町**********	S13016	20150629	C-20090923-C
4681	CS009815000003	中田里穂	1	女性	1929-04-08	89	154-0014	東京都世田谷区新町**********	S13009	20150421	D-20091021-E
6306	CS012813000013	宇野南朋	1	女性	1929-04-09	89	231-0806	神奈川県横浜市中区本牧町**********	S14012	20150712	0-00000000-0
16071	CS005813000015	金谷恵梨香	1	女性	1929-04-09	89	165-0032	東京都中野区鷺宮**********	S13005	20150506	0-00000000-0

Commentary :

This code is sorting the df_customer data frame by the birth_day column in ascending order and selecting the first 10 rows of the resulting data frame using the head() function. Here's a breakdown of the code:

df_customer: This is the data frame being used.

[order(df_customer$birth_day), ]: This is subsetting the df_customer data frame to include all columns, but sorting the rows by the birth_day column in ascending order. The order() function returns a vector of indices that sort the birth_day column in ascending order. By applying these indices to the rows of the data frame with the [ operator, the data frame is sorted by the birth_day column.

head(., n = 10): This is selecting the first 10 rows of the resulting data frame using the head() function.

So, putting it all together, the code is sorting the df_customer data frame by the birth_day column in ascending order, and then selecting the first 10 rows of the resulting data frame using the head() function. This code is useful when you want to quickly see the oldest customers in the df_customer data frame.

R-018: Sort customer data (df_customer) by birth_day(birth_day) in descending order, displaying all 10 items from the top.

In [ ]:

head( df_customer[order(df_customer$birth_day, decreasing = TRUE), ], n = 10 )

A data.frame: 10 × 11

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
	<chr>	<chr>	<chr>	<chr>	<date>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>
15640	CS035114000004	大村美里	1	女性	2007-11-25	11	156-0053	東京都世田谷区桜**********	S13035	20150619	6-20091205-6
7469	CS022103000002	福山はじめ	9	不明	2007-10-02	11	249-0006	神奈川県逗子市逗子**********	S14022	20160909	0-00000000-0
10746	CS002113000009	柴田真悠子	1	女性	2007-09-17	11	184-0014	東京都小金井市貫井南町**********	S13002	20160304	0-00000000-0
19812	CS004115000014	松井京子	1	女性	2007-08-09	11	165-0031	東京都中野区上鷺宮**********	S13004	20161120	1-20081231-1
7040	CS002114000010	山内遥	1	女性	2007-06-03	11	184-0015	東京都小金井市貫井北町**********	S13002	20160920	6-20100510-1
3671	CS025115000002	小柳夏希	1	女性	2007-04-18	11	245-0018	神奈川県横浜市泉区上飯田町**********	S14025	20160116	D-20100913-D
12494	CS002113000025	広末まなみ	1	女性	2007-03-30	12	184-0015	東京都小金井市貫井北町**********	S13002	20171030	0-00000000-0
15978	CS033112000003	長野美紀	1	女性	2007-03-22	12	245-0051	神奈川県横浜市戸塚区名瀬町**********	S14033	20150606	0-00000000-0
5717	CS007115000006	福岡瞬	1	女性	2007-03-10	12	285-0845	千葉県佐倉市西志津**********	S12007	20151118	F-20101016-F
15098	CS014113000008	矢口莉緒	1	女性	2007-03-05	12	260-0041	千葉県千葉市中央区東千葉**********	S12014	20150622	3-20091108-6

Commentary :

This code is sorting the df_customer data frame by the birth_day column in descending order and selecting the first 10 rows of the resulting data frame using the head() function. Here's a breakdown of the code:

df_customer: This is the data frame being used.

[order(df_customer$birth_day, decreasing = TRUE), ]: This is subsetting the df_customer data frame to include all columns, but sorting the rows by the birth_day column in descending order. The order() function returns a vector of indices that sort the birth_day column in ascending order by default, but by setting decreasing = TRUE, it sorts the column in descending order instead. By applying these indices to the rows of the data frame with the [ operator, the data frame is sorted by the birth_day column in descending order.

head(., n = 10): This is selecting the first 10 rows of the resulting data frame using the head() function.

So, putting it all together, the code is sorting the df_customer data frame by the birth_day column in descending order, and then selecting the first 10 rows of the resulting data frame using the head() function. This code is useful when you want to quickly see the youngest customers in the df_customer data frame.

R-019: Rank the receipt details data (df_receipt) in descending order of sales amount per item (amount) and display 10 items from the top. The items shall display the customer ID (customer_id), sales amount (amount) and the rank assigned. If the sales amount (amount) is equal, the same rank shall be assigned.

In [ ]:

df_receipt[c("customer_id", "amount")] %>% 
mutate(ranking = min_rank(desc(amount))) %>% 
arrange(ranking) %>% 
slice(1:10)

A data.frame: 10 × 3

customer_id	amount	ranking
<chr>	<int>	<int>
CS011415000006	10925	1
ZZ000000000000	6800	2
CS028605000002	5780	3
CS015515000034	5480	4
ZZ000000000000	5480	4
ZZ000000000000	5480	4
ZZ000000000000	5440	7
CS021515000089	5440	7
CS015515000083	5280	9
CS017414000114	5280	9

Commentary :

This code is selecting the columns customer_id and amount from the df_receipt data frame, adding a new column called ranking that indicates the ranking of each customer by their transaction amount (with the highest amount having a ranking of 1), sorting the data frame by the ranking column, and selecting the top 10 rows of the resulting data frame using the slice() function. Here's a breakdown of the code:

df_receipt[c("customer_id", "amount")]: This is subsetting the df_receipt data frame to include only the customer_id and amount columns.

%>%: This is the pipe operator, which passes the output of the previous operation as the input to the next operation.

mutate(ranking = min_rank(desc(amount))): This is adding a new column to the data frame called ranking that indicates the ranking of each customer by their transaction amount. The min_rank() function from the dplyr package calculates the minimum rank of each value in a vector, with ties being broken arbitrarily. By applying this function to the amount column after taking the negative (desc() function) so that the highest value has a rank of 1, we obtain a ranking of the customers by their transaction amount.

arrange(ranking): This is sorting the data frame by the ranking column in ascending order.

slice(1:10): This is selecting the top 10 rows of the resulting data frame using the slice() function.

So, putting it all together, the code is selecting the customer_id and amount columns from the df_receipt data frame, adding a new column called ranking that indicates the ranking of each customer by their transaction amount, sorting the data frame by the ranking column, and selecting the top 10 rows of the resulting data frame. This code is useful when you want to quickly see the top 10 customers by transaction amount in the df_receipt data frame.

R-020: For receipt details data (df_receipt), assign ranks in descending order of sales amount (amount) per item and display the first 10 items. Items shall display the customer ID (customer_id), sales amount (amount) and the rank assigned. Even if the sales amount (amount) is equal, a different rank should be assigned.

In [ ]:

df_receipt[c("customer_id", "amount")] %>% 
mutate(ranking = row_number(desc(amount))) %>% 
arrange(ranking) %>% 
slice(1:10)

A data.frame: 10 × 3

customer_id	amount	ranking
<chr>	<int>	<int>
CS011415000006	10925	1
ZZ000000000000	6800	2
CS028605000002	5780	3
CS015515000034	5480	4
ZZ000000000000	5480	5
ZZ000000000000	5480	6
ZZ000000000000	5440	7
CS021515000089	5440	8
CS015515000083	5280	9
CS017414000114	5280	10

Commentary :

This code performs a series of operations on a data frame called df_receipt containing columns "customer_id" and "amount".

Here's a step-by-step explanation of what the code does:

df_receipt[c("customer_id", "amount")]: This selects only the "customer_id" and "amount" columns from the df_receipt data frame. The resulting data frame contains only these two columns.

%>%: This is the pipe operator in R, which is used to chain together multiple operations on a data frame. The output of one operation is passed as input to the next operation.

mutate(ranking = row_number(desc(amount))): This creates a new column called "ranking" in the data frame using the mutate function. The "ranking" column is based on the "amount" column, with the highest value of "amount" receiving a rank of 1, the second-highest value receiving a rank of 2, and so on. The desc function is used to sort the "amount" column in descending order.

arrange(ranking): This sorts the data frame by the "ranking" column in ascending order (i.e., from 1 to n).

slice(1:10): This selects the first 10 rows of the sorted data frame (i.e., the 10 rows with the highest "amount" values and their corresponding "customer_id" values). The slice function is used to subset rows from the data frame.

In summary, the code selects the "customer_id" and "amount" columns from a data frame, ranks the "amount" values in descending order, sorts the data frame by the rank, and selects the top 10 rows (i.e., the 10 customers with the highest "amount" values).

Data Science 100 Knocks (Structured Data Processing) - R

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - R Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - R Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - R Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - R Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - R Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing) - SQL

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.