参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
You can get preprocess_knock files to practice python, SQL, R from my github account
Introduction
- To begin with, run the following cells
- Import the required libraries and read the data from the database (PostgreSQL)
- The libraries that are expected to be used are imported in the following cells
- If there are other libraries you wish to use, install them as appropriate using install.packages().
- Names, addresses, etc. are dummy data and are not real.
require("RPostgreSQL")
require("tidyr")
require("dplyr")
require("stringr")
require("caret")
require("lubridate")
require("rsample")
require("recipes")
require("themis")
host <- "db"
port <- Sys.getenv()["PG_PORT"]
dbname <- Sys.getenv()["PG_DATABASE"]
user <- Sys.getenv()["PG_USER"]
password <- Sys.getenv()["PG_PASSWORD"]
con <- dbConnect(PostgreSQL(), host=host, port=port, dbname=dbname, user=user, password=password)
df_customer <- dbGetQuery(con,"SELECT * FROM customer")
df_category <- dbGetQuery(con,"SELECT * FROM category")
df_product <- dbGetQuery(con,"SELECT * FROM product")
df_receipt <- dbGetQuery(con,"SELECT * FROM receipt")
df_store <- dbGetQuery(con,"SELECT * FROM store")
df_geocode <- dbGetQuery(con,"SELECT * FROM geocode")
Loading required package: RPostgreSQL Loading required package: DBI Loading required package: tidyr Loading required package: dplyr Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Loading required package: stringr Loading required package: caret Loading required package: ggplot2 Loading required package: lattice Loading required package: lubridate Attaching package: ‘lubridate’ The following objects are masked from ‘package:base’: date, intersect, setdiff, union Loading required package: rsample Loading required package: recipes Attaching package: ‘recipes’ The following object is masked from ‘package:stringr’: fixed The following object is masked from ‘package:stats’: step Loading required package: themis Registered S3 methods overwritten by 'themis': method from bake.step_downsample recipes bake.step_upsample recipes prep.step_downsample recipes prep.step_upsample recipes tidy.step_downsample recipes tidy.step_upsample recipes tunable.step_downsample recipes tunable.step_upsample recipes Attaching package: ‘themis’ The following objects are masked from ‘package:recipes’: step_downsample, step_upsample
Practice Questions
R-001: Display the first 10 items of all items from the receipt details data (df_receipt) and visually check what data is held.
head(df_receipt, n = 10)
sales_ymd | sales_epoch | store_cd | receipt_no | receipt_sub_no | customer_id | product_cd | quantity | amount | |
---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <chr> | <int> | <int> | <chr> | <chr> | <int> | <int> | |
1 | 20181103 | 1541203200 | S14006 | 112 | 1 | CS006214000001 | P070305012 | 1 | 158 |
2 | 20181118 | 1542499200 | S13008 | 1132 | 2 | CS008415000097 | P070701017 | 1 | 81 |
3 | 20170712 | 1499817600 | S14028 | 1102 | 1 | CS028414000014 | P060101005 | 1 | 170 |
4 | 20190205 | 1549324800 | S14042 | 1132 | 1 | ZZ000000000000 | P050301001 | 1 | 25 |
5 | 20180821 | 1534809600 | S14025 | 1102 | 2 | CS025415000050 | P060102007 | 1 | 90 |
6 | 20190605 | 1559692800 | S13003 | 1112 | 1 | CS003515000195 | P050102002 | 1 | 138 |
7 | 20181205 | 1543968000 | S14024 | 1102 | 2 | CS024514000042 | P080101005 | 1 | 30 |
8 | 20190922 | 1569110400 | S14040 | 1102 | 1 | CS040415000178 | P070501004 | 1 | 128 |
9 | 20170504 | 1493856000 | S13020 | 1112 | 2 | ZZ000000000000 | P071302010 | 1 | 770 |
10 | 20191010 | 1570665600 | S14027 | 1102 | 1 | CS027514000015 | P071101003 | 1 | 680 |
Commentary :
The codehead(df_receipt, n = 10)
is used to display the first 10 rows of a data frame calleddf_receipt
.
In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.
Thehead()
function in R is used to display the first few rows of a data frame. By default, it displays the first 6 rows of the data frame. However, then
argument can be used to specify the number of rows to display. In this case,n = 10
is used to display the first 10 rows of thedf_receipt
data frame.
Therefore, when this code is executed, the output will show the first 10 rows of thedf_receipt
data frame.
R-002: Display 10 items from the receipt details data (df_receipt), specifying the columns in the order of sales_date (sales_ymd), customer_id (customer_id), product_code (product_cd) and sales amount (amount).
head(df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], n = 10)
sales_ymd | customer_id | product_cd | amount | |
---|---|---|---|---|
<int> | <chr> | <chr> | <int> | |
1 | 20181103 | CS006214000001 | P070305012 | 158 |
2 | 20181118 | CS008415000097 | P070701017 | 81 |
3 | 20170712 | CS028414000014 | P060101005 | 170 |
4 | 20190205 | ZZ000000000000 | P050301001 | 25 |
5 | 20180821 | CS025415000050 | P060102007 | 90 |
6 | 20190605 | CS003515000195 | P050102002 | 138 |
7 | 20181205 | CS024514000042 | P080101005 | 30 |
8 | 20190922 | CS040415000178 | P070501004 | 128 |
9 | 20170504 | ZZ000000000000 | P071302010 | 770 |
10 | 20191010 | CS027514000015 | P071101003 | 680 |
Commentary :
The codehead(df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], n = 10)
is used to display the first 10 rows of a data frame calleddf_receipt
, but only showing the columns "sales_ymd", "customer_id", "product_cd", and "amount".
In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.
Thehead()
function in R is used to display the first few rows of a data frame. By default, it displays the first 6 rows of the data frame. However, then
argument can be used to specify the number of rows to display.
In this case,df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]
is used to subset thedf_receipt
data frame and select only the columns "sales_ymd", "customer_id", "product_cd", and "amount". Thec()
function is used to concatenate the column names into a vector.
Therefore, when this code is executed, the output will show the first 10 rows of thedf_receipt
data frame, but only showing the columns "sales_ymd", "customer_id", "product_cd", and "amount". This can be useful when working with large data frames and only needing to focus on a subset of the columns.
R-003: From the receipt details data (df_receipt), specify the columns in order of sales_date (sales_ymd), customer_id (customer_id), product_code (product_cd) and sales_amount (amount) and display 10 items. However, extract while renaming the items from sales_ymd to sales_date.
head(rename(df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], sales_date = sales_ymd), n = 10)
sales_date | customer_id | product_cd | amount | |
---|---|---|---|---|
<int> | <chr> | <chr> | <int> | |
1 | 20181103 | CS006214000001 | P070305012 | 158 |
2 | 20181118 | CS008415000097 | P070701017 | 81 |
3 | 20170712 | CS028414000014 | P060101005 | 170 |
4 | 20190205 | ZZ000000000000 | P050301001 | 25 |
5 | 20180821 | CS025415000050 | P060102007 | 90 |
6 | 20190605 | CS003515000195 | P050102002 | 138 |
7 | 20181205 | CS024514000042 | P080101005 | 30 |
8 | 20190922 | CS040415000178 | P070501004 | 128 |
9 | 20170504 | ZZ000000000000 | P071302010 | 770 |
10 | 20191010 | CS027514000015 | P071101003 | 680 |
Commentary :
The codehead(rename(df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")], sales_date = sales_ymd), n = 10)
is used to display the first 10 rows of a data frame calleddf_receipt
, but only showing the columns "sales_ymd", "customer_id", "product_cd", and "amount", with the column "sales_ymd" renamed to "sales_date".
In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.
Thehead()
function in R is used to display the first few rows of a data frame. By default, it displays the first 6 rows of the data frame. However, then
argument can be used to specify the number of rows to display.
Therename()
function in R is used to rename columns in a data frame. In this case, it is used to rename the column "sales_ymd" to "sales_date".
In this code,df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]
is used to subset thedf_receipt
data frame and select only the columns "sales_ymd", "customer_id", "product_cd", and "amount". Thec()
function is used to concatenate the column names into a vector.
Then, therename()
function is applied to the subsetted data frame using the syntaxrename(data_frame, new_column_name = old_column_name)
. The result is a data frame with the same rows and columns asdf_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]
, but with the column "sales_ymd" renamed to "sales_date".
Finally, thehead()
function is applied to the renamed data frame to show the first 10 rows.
Therefore, when this code is executed, the output will show the first 10 rows of thedf_receipt
data frame, but only showing the columns "sales_date", "customer_id", "product_cd", and "amount", with the column "sales_ymd" renamed to "sales_date".
R-004: From receipt details data (df_receipt), specify columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and extract data satisfying the following conditions.
- Customer ID (customer_id) is “CS018205000001”
df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>%
filter(customer_id == "CS018205000001")
sales_ymd | customer_id | product_cd | amount |
---|---|---|---|
<int> | <chr> | <chr> | <int> |
20180911 | CS018205000001 | P071401012 | 2200 |
20180414 | CS018205000001 | P060104007 | 600 |
20170614 | CS018205000001 | P050206001 | 990 |
20170614 | CS018205000001 | P060702015 | 108 |
20190216 | CS018205000001 | P071005024 | 102 |
20180414 | CS018205000001 | P071101002 | 278 |
20190226 | CS018205000001 | P070902035 | 168 |
20190924 | CS018205000001 | P060805001 | 495 |
20190226 | CS018205000001 | P071401020 | 2200 |
20180911 | CS018205000001 | P071401005 | 1100 |
20190216 | CS018205000001 | P040101002 | 218 |
20190924 | CS018205000001 | P091503001 | 280 |
Commentary :
The codedf_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>% filter(customer_id == "CS018205000001")
is used to subset thedf_receipt
data frame and select only the columns "sales_ymd", "customer_id", "product_cd", and "amount", and then filter the resulting data frame to only include rows where the "customer_id" column is equal to "CS018205000001".
In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.
The%>%
operator in R is called the pipe operator and is used to chain together multiple functions into a single expression. In this case, it is used to pass the output ofdf_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]
into thefilter()
function.
Thefilter()
function in R is used to select rows from a data frame that meet certain conditions. In this case, it is used to filter the data frame to only include rows where the "customer_id" column is equal to "CS018205000001".
Therefore, when this code is executed, the output will be a data frame that contains only the columns "sales_ymd", "customer_id", "product_cd", and "amount", and only includes rows where the "customer_id" column is equal to "CS018205000001". This can be useful when analyzing a large data frame and only needing to focus on a subset of the rows that meet certain criteria.
R-005: Extract data from receipt details data (df_receipt), specifying columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and satisfying all the following conditions.
- Customer ID (customer_id) is “CS018205000001”.
- Sales amount (amount) is 1,000 or more.
df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>%
filter(customer_id == "CS018205000001" & amount >= 1000)
sales_ymd | customer_id | product_cd | amount |
---|---|---|---|
<int> | <chr> | <chr> | <int> |
20180911 | CS018205000001 | P071401012 | 2200 |
20190226 | CS018205000001 | P071401020 | 2200 |
20180911 | CS018205000001 | P071401005 | 1100 |
Commentary :
The codedf_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>% filter(customer_id == "CS018205000001" & amount >= 1000)
is used to subset thedf_receipt
data frame and select only the columns "sales_ymd", "customer_id", "product_cd", and "amount", and then filter the resulting data frame to only include rows where the "customer_id" column is equal to "CS018205000001" AND the "amount" column is greater than or equal to 1000.
In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.
The%>%
operator in R is called the pipe operator and is used to chain together multiple functions into a single expression. In this case, it is used to pass the output ofdf_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]
into thefilter()
function.
Thefilter()
function in R is used to select rows from a data frame that meet certain conditions. In this case, it is used to filter the data frame to only include rows where the "customer_id" column is equal to "CS018205000001" AND the "amount" column is greater than or equal to 1000. The&
operator is used to combine the two conditions.
Therefore, when this code is executed, the output will be a data frame that contains only the columns "sales_ymd", "customer_id", "product_cd", and "amount", and only includes rows where the "customer_id" column is equal to "CS018205000001" AND the "amount" column is greater than or equal to 1000. This can be useful when analyzing a large data frame and only needing to focus on a subset of the rows that meet certain criteria.
R-006: Extract data from receipt details data (df_receipt), specifying the columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd), sales quantity (quantity) and sales amount (amount) and satisfying all the following conditions.
- Customer ID (customer_id) is “CS018205000001”.
- Sales amount (amount) is 1,000 or more or sales quantity (quantity) is 5 or more.
df_receipt[c("sales_ymd", "customer_id", "product_cd", "quantity", "amount")] %>% filter(customer_id == "CS018205000001" & (amount >= 1000 | quantity >= 5))
sales_ymd | customer_id | product_cd | quantity | amount |
---|---|---|---|---|
<int> | <chr> | <chr> | <int> | <int> |
20180911 | CS018205000001 | P071401012 | 1 | 2200 |
20180414 | CS018205000001 | P060104007 | 6 | 600 |
20170614 | CS018205000001 | P050206001 | 5 | 990 |
20190226 | CS018205000001 | P071401020 | 1 | 2200 |
20180911 | CS018205000001 | P071401005 | 1 | 1100 |
Commentary :
The codedf_receipt[c("sales_ymd", "customer_id", "product_cd", "quantity", "amount")] %>% filter(customer_id == "CS018205000001" & (amount >= 1000 | quantity >= 5))
is used to subset thedf_receipt
data frame and select only the columns "sales_ymd", "customer_id", "product_cd", "quantity", and "amount", and then filter the resulting data frame to only include rows where the "customer_id" column is equal to "CS018205000001" AND either the "amount" column is greater than or equal to 1000 OR the "quantity" column is greater than or equal to 5.
In R, a data frame is a table-like structure where each column can have a different data type, such as numeric, character, or factor. It is often used to store and manipulate data in a structured format.
The%>%
operator in R is called the pipe operator and is used to chain together multiple functions into a single expression. In this case, it is used to pass the output ofdf_receipt[c("sales_ymd", "customer_id", "product_cd", "quantity", "amount")]
into thefilter()
function.
Thefilter()
function in R is used to select rows from a data frame that meet certain conditions. In this case, it is used to filter the data frame to only include rows where the "customer_id" column is equal to "CS018205000001" AND either the "amount" column is greater than or equal to 1000 OR the "quantity" column is greater than or equal to 5. The&
operator is used to combine the two conditions, and the|
operator is used to specify an OR condition.
Therefore, when this code is executed, the output will be a data frame that contains only the columns "sales_ymd", "customer_id", "product_cd", "quantity", and "amount", and only includes rows where the "customer_id" column is equal to "CS018205000001" AND either the "amount" column is greater than or equal to 1000 OR the "quantity" column is greater than or equal to 5. This can be useful when analyzing a large data frame and only needing to focus on a subset of the rows that meet certain criteria.
R-007: From the receipt details data (df_receipt), specify the columns in order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and extract data that satisfy all the following conditions.
- Customer ID (customer_id) is “CS018205000001”.
- Sales amount (amount) is between 1,000 and 2,000.
df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>%
filter(customer_id == "CS018205000001" & between(amount, 1000, 2000))
sales_ymd | customer_id | product_cd | amount |
---|---|---|---|
<int> | <chr> | <chr> | <int> |
20180911 | CS018205000001 | P071401005 | 1100 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_receipt
. Here's a breakdown of the code:df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]
: This is subsetting thedf_receipt
data frame to only include the columns "sales_ymd", "customer_id", "product_cd", and "amount".%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(customer_id == "CS018205000001" & between(amount, 1000, 2000))
: This is filtering the data frame to only include rows where the customer ID is "CS018205000001" and the amount is between 1000 and 2000.
So, putting it all together, the code is subsetting thedf_receipt
data frame to only include the columns "sales_ymd", "customer_id", "product_cd", and "amount", and then filtering the resulting data frame to only include rows where the customer ID is "CS018205000001" and the amount is between 1000 and 2000.
R-008: From the receipt details data (df_receipt), specify the columns in order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount), and extract data satisfying all the following conditions.
- Customer ID (customer_id) is “CS018205000001”.
- Product code (product_cd) is other than “P071401019”.
df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")] %>%
filter(customer_id == "CS018205000001" & product_cd != "P071401019")
sales_ymd | customer_id | product_cd | amount |
---|---|---|---|
<int> | <chr> | <chr> | <int> |
20180911 | CS018205000001 | P071401012 | 2200 |
20180414 | CS018205000001 | P060104007 | 600 |
20170614 | CS018205000001 | P050206001 | 990 |
20170614 | CS018205000001 | P060702015 | 108 |
20190216 | CS018205000001 | P071005024 | 102 |
20180414 | CS018205000001 | P071101002 | 278 |
20190226 | CS018205000001 | P070902035 | 168 |
20190924 | CS018205000001 | P060805001 | 495 |
20190226 | CS018205000001 | P071401020 | 2200 |
20180911 | CS018205000001 | P071401005 | 1100 |
20190216 | CS018205000001 | P040101002 | 218 |
20190924 | CS018205000001 | P091503001 | 280 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_receipt
. Here's a breakdown of the code:df_receipt[c("sales_ymd", "customer_id", "product_cd", "amount")]
: This is subsetting thedf_receipt
data frame to only include the columns "sales_ymd", "customer_id", "product_cd", and "amount".%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(customer_id == "CS018205000001" & product_cd != "P071401019")
: This is filtering the data frame to only include rows where the customer ID is "CS018205000001" and the product code is not equal to "P071401019".
So, putting it all together, the code is subsetting thedf_receipt
data frame to only include the columns "sales_ymd", "customer_id", "product_cd", and "amount", and then filtering the resulting data frame to only include rows where the customer ID is "CS018205000001" and the product code is not equal to "P071401019".
R-009: In the following process, rewrite OR to AND without changing the output result.
df_store %>%
filter(!(prefecture_cd == "13" | floor_area > 900))
df_store %>%
filter(prefecture_cd != "13" & floor_area <= 900)
store_cd | store_name | prefecture_cd | prefecture | address | address_kana | tel_no | longitude | latitude | floor_area |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
S14046 | 北山田店 | 14 | 神奈川県 | 神奈川県横浜市都筑区北山田一丁目 | カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ | 045-123-4049 | 139.5916 | 35.56189 | 831 |
S14011 | 日吉本町店 | 14 | 神奈川県 | 神奈川県横浜市港北区日吉本町四丁目 | カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ | 045-123-4033 | 139.6316 | 35.54655 | 890 |
S12013 | 習志野店 | 12 | 千葉県 | 千葉県習志野市芝園一丁目 | チバケンナラシノシシバゾノイッチョウメ | 047-123-4002 | 140.0220 | 35.66122 | 808 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_store
. Here's a breakdown of the code:%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(prefecture_cd != "13" & floor_area <= 900)
: This is filtering the data frame to only include rows where the prefecture code is not equal to "13" and the floor area is less than or equal to 900.
So, putting it all together, the code is filtering thedf_store
data frame to only include rows where the prefecture code is not equal to "13" and the floor area is less than or equal to 900.
R-010: From the shop data (df_store), extract all items whose shop code (store_cd) starts with “S14” and display 10 items.
head(df_store %>%
filter(startsWith(store_cd, "S14")), n = 10)
store_cd | store_name | prefecture_cd | prefecture | address | address_kana | tel_no | longitude | latitude | floor_area | |
---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
1 | S14010 | 菊名店 | 14 | 神奈川県 | 神奈川県横浜市港北区菊名一丁目 | カナガワケンヨコハマシコウホククキクナイッチョウメ | 045-123-4032 | 139.6326 | 35.50049 | 1732 |
2 | S14033 | 阿久和店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区阿久和西一丁目 | カナガワケンヨコハマシセヤクアクワニシイッチョウメ | 045-123-4043 | 139.4961 | 35.45918 | 1495 |
3 | S14036 | 相模原中央店 | 14 | 神奈川県 | 神奈川県相模原市中央二丁目 | カナガワケンサガミハラシチュウオウニチョウメ | 042-123-4045 | 139.3716 | 35.57327 | 1679 |
4 | S14040 | 長津田店 | 14 | 神奈川県 | 神奈川県横浜市緑区長津田みなみ台五丁目 | カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ | 045-123-4046 | 139.4994 | 35.52398 | 1548 |
5 | S14050 | 阿久和西店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区阿久和西一丁目 | カナガワケンヨコハマシセヤクアクワニシイッチョウメ | 045-123-4053 | 139.4961 | 35.45918 | 1830 |
6 | S14028 | 二ツ橋店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区二ツ橋町 | カナガワケンヨコハマシセヤクフタツバシチョウ | 045-123-4042 | 139.4963 | 35.46304 | 1574 |
7 | S14012 | 本牧和田店 | 14 | 神奈川県 | 神奈川県横浜市中区本牧和田 | カナガワケンヨコハマシナカクホンモクワダ | 045-123-4034 | 139.6582 | 35.42156 | 1341 |
8 | S14046 | 北山田店 | 14 | 神奈川県 | 神奈川県横浜市都筑区北山田一丁目 | カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ | 045-123-4049 | 139.5916 | 35.56189 | 831 |
9 | S14022 | 逗子店 | 14 | 神奈川県 | 神奈川県逗子市逗子一丁目 | カナガワケンズシシズシイッチョウメ | 046-123-4036 | 139.5789 | 35.29642 | 1838 |
10 | S14011 | 日吉本町店 | 14 | 神奈川県 | 神奈川県横浜市港北区日吉本町四丁目 | カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ | 045-123-4033 | 139.6316 | 35.54655 | 890 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_store
. Here's a breakdown of the code:df_store
: This is the data frame being used.%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(startsWith(store_cd, "S14"))
: This is filtering the data frame to only include rows where thestore_cd
column starts with "S14".head(., n = 10)
: This is selecting the first 10 rows of the resulting data frame using thehead()
function.
So, putting it all together, the code is filtering thedf_store
data frame to only include rows where thestore_cd
column starts with "S14", and then selecting the first 10 rows of the resulting data frame using thehead()
function.
R-011: From the customer data (df_customer), extract all items whose customer ID (customer_id) ends with 1, and display 10 items.
head(df_customer %>%
filter(endsWith(customer_id, "1")), n = 10)
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | CS037613000071 | 六角 雅彦 | 9 | 不明 | 1952-04-01 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 |
2 | CS028811000001 | 堀井 かおり | 1 | 女性 | 1933-03-27 | 86 | 245-0016 | 神奈川県横浜市泉区和泉町********** | S14028 | 20160115 | 0-00000000-0 |
3 | CS040412000191 | 川井 郁恵 | 1 | 女性 | 1977-01-05 | 42 | 226-0021 | 神奈川県横浜市緑区北八朔町********** | S14040 | 20151101 | 1-20091025-4 |
4 | CS028314000011 | 小菅 あおい | 1 | 女性 | 1983-11-26 | 35 | 246-0038 | 神奈川県横浜市瀬谷区宮沢********** | S14028 | 20151123 | 1-20080426-5 |
5 | CS039212000051 | 藤島 恵梨香 | 1 | 女性 | 1997-02-03 | 22 | 166-0001 | 東京都杉並区阿佐谷北********** | S13039 | 20171121 | 1-20100215-4 |
6 | CS015412000111 | 松居 奈月 | 1 | 女性 | 1972-10-04 | 46 | 136-0071 | 東京都江東区亀戸********** | S13015 | 20150629 | 0-00000000-0 |
7 | CS004702000041 | 野島 洋 | 0 | 男性 | 1943-08-24 | 75 | 176-0022 | 東京都練馬区向山********** | S13004 | 20170218 | 0-00000000-0 |
8 | CS041515000001 | 栗田 千夏 | 1 | 女性 | 1967-01-02 | 52 | 206-0001 | 東京都多摩市和田********** | S13041 | 20160422 | E-20100803-F |
9 | CS029313000221 | 北条 ひかり | 1 | 女性 | 1987-06-19 | 31 | 279-0011 | 千葉県浦安市美浜********** | S12029 | 20180810 | 0-00000000-0 |
10 | CS034312000071 | 望月 奈央 | 1 | 女性 | 1980-09-20 | 38 | 213-0026 | 神奈川県川崎市高津区久末********** | S14034 | 20160106 | 0-00000000-0 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_customer
. Here's a breakdown of the code:df_customer
: This is the data frame being used.%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(endsWith(customer_id, "1"))
: This is filtering the data frame to only include rows where thecustomer_id
column ends with "1".head(., n = 10)
: This is selecting the first 10 rows of the resulting data frame using thehead()
function.
So, putting it all together, the code is filtering thedf_customer
data frame to only include rows where thecustomer_id
column ends with "1", and then selecting the first 10 rows of the resulting data frame using thehead()
function.
R-012: From shop data (df_store), display all items with “横浜市(Yokohama-city)” in the address.
df_store %>%
filter(grepl("横浜市", address))
store_cd | store_name | prefecture_cd | prefecture | address | address_kana | tel_no | longitude | latitude | floor_area |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
S14010 | 菊名店 | 14 | 神奈川県 | 神奈川県横浜市港北区菊名一丁目 | カナガワケンヨコハマシコウホククキクナイッチョウメ | 045-123-4032 | 139.6326 | 35.50049 | 1732 |
S14033 | 阿久和店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区阿久和西一丁目 | カナガワケンヨコハマシセヤクアクワニシイッチョウメ | 045-123-4043 | 139.4961 | 35.45918 | 1495 |
S14040 | 長津田店 | 14 | 神奈川県 | 神奈川県横浜市緑区長津田みなみ台五丁目 | カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ | 045-123-4046 | 139.4994 | 35.52398 | 1548 |
S14050 | 阿久和西店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区阿久和西一丁目 | カナガワケンヨコハマシセヤクアクワニシイッチョウメ | 045-123-4053 | 139.4961 | 35.45918 | 1830 |
S14028 | 二ツ橋店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区二ツ橋町 | カナガワケンヨコハマシセヤクフタツバシチョウ | 045-123-4042 | 139.4963 | 35.46304 | 1574 |
S14012 | 本牧和田店 | 14 | 神奈川県 | 神奈川県横浜市中区本牧和田 | カナガワケンヨコハマシナカクホンモクワダ | 045-123-4034 | 139.6582 | 35.42156 | 1341 |
S14046 | 北山田店 | 14 | 神奈川県 | 神奈川県横浜市都筑区北山田一丁目 | カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ | 045-123-4049 | 139.5916 | 35.56189 | 831 |
S14011 | 日吉本町店 | 14 | 神奈川県 | 神奈川県横浜市港北区日吉本町四丁目 | カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ | 045-123-4033 | 139.6316 | 35.54655 | 890 |
S14048 | 中川中央店 | 14 | 神奈川県 | 神奈川県横浜市都筑区中川中央二丁目 | カナガワケンヨコハマシツヅキクナカガワチュウオウニチョウメ | 045-123-4051 | 139.5758 | 35.54912 | 1657 |
S14042 | 新山下店 | 14 | 神奈川県 | 神奈川県横浜市中区新山下二丁目 | カナガワケンヨコハマシナカクシンヤマシタニチョウメ | 045-123-4047 | 139.6593 | 35.43894 | 1044 |
S14006 | 葛が谷店 | 14 | 神奈川県 | 神奈川県横浜市都筑区葛が谷 | カナガワケンヨコハマシツヅキククズガヤ | 045-123-4031 | 139.5633 | 35.53573 | 1886 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_store
. Here's a breakdown of the code:df_store
: This is the data frame being used.%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(grepl("横浜市", address))
: This is filtering the data frame to only include rows where theaddress
column contains the string "横浜市". Thegrepl()
function returns a logical vector indicating whether the pattern "横浜市" is found in each element of theaddress
column.
So, putting it all together, the code is filtering thedf_store
data frame to only include rows where theaddress
column contains the string "横浜市".
R-013: From the customer data (df_customer), extract all data whose status code (status_cd) begins with the letters A to F and display 10 items.
head(df_customer %>%
filter(grepl("^[A-F]", status_cd)), n = 10)
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | CS031415000172 | 宇多田 貴美子 | 1 | 女性 | 1976-10-04 | 42 | 151-0053 | 東京都渋谷区代々木********** | S13031 | 20150529 | D-20100325-C |
2 | CS015414000103 | 奥野 陽子 | 1 | 女性 | 1977-08-09 | 41 | 136-0073 | 東京都江東区北砂********** | S13015 | 20150722 | B-20100609-B |
3 | CS011215000048 | 芦田 沙耶 | 1 | 女性 | 1992-02-01 | 27 | 223-0062 | 神奈川県横浜市港北区日吉本町********** | S14011 | 20150228 | C-20100421-9 |
4 | CS029415000023 | 梅田 里穂 | 1 | 女性 | 1976-01-17 | 43 | 279-0043 | 千葉県浦安市富士見********** | S12029 | 20150610 | D-20100918-E |
5 | CS035415000029 | 寺沢 真希 | 9 | 不明 | 1977-09-27 | 41 | 158-0096 | 東京都世田谷区玉川台********** | S13035 | 20141220 | F-20101029-F |
6 | CS031415000106 | 宇野 由美子 | 1 | 女性 | 1970-02-26 | 49 | 151-0053 | 東京都渋谷区代々木********** | S13031 | 20150201 | F-20100511-E |
7 | CS029215000025 | 石倉 美帆 | 1 | 女性 | 1993-09-28 | 25 | 279-0022 | 千葉県浦安市今川********** | S12029 | 20150708 | B-20100820-C |
8 | CS033605000005 | 猪股 雄太 | 0 | 男性 | 1955-12-05 | 63 | 246-0031 | 神奈川県横浜市瀬谷区瀬谷********** | S14033 | 20150425 | F-20100917-E |
9 | CS033415000229 | 板垣 菜々美 | 1 | 女性 | 1977-11-07 | 41 | 246-0021 | 神奈川県横浜市瀬谷区二ツ橋町********** | S14033 | 20150712 | F-20100326-E |
10 | CS008415000145 | 黒谷 麻緒 | 1 | 女性 | 1977-06-27 | 41 | 157-0067 | 東京都世田谷区喜多見********** | S13008 | 20150829 | F-20100622-F |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_customer
. Here's a breakdown of the code:df_customer
: This is the data frame being used.%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(grepl("^[A-F]", status_cd))
: This is filtering the data frame to only include rows where thestatus_cd
column starts with the letters A, B, C, D, E or F. Thegrepl()
function returns a logical vector indicating whether the regular expression "^[A-F]" matches each element of thestatus_cd
column. The regular expression "^[A-F]" matches any string that starts with the letters A, B, C, D, E or F.head(., n = 10)
: This is selecting the first 10 rows of the resulting data frame using thehead()
function.
So, putting it all together, the code is filtering thedf_customer
data frame to only include rows where thestatus_cd
column starts with the letters A, B, C, D, E or F, and then selecting the first 10 rows of the resulting data frame using thehead()
function.
R-014: From the customer data (df_customer), extract all data items whose status code (status_cd) ends with the numbers 1 to 9, and display 10 items.
head(df_customer %>%
filter(grepl("[1-9]$", status_cd)), n = 10)
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | CS001215000145 | 田崎 美紀 | 1 | 女性 | 1995-03-29 | 24 | 144-0055 | 東京都大田区仲六郷********** | S13001 | 20170605 | 6-20090929-2 |
2 | CS033513000180 | 安斎 遥 | 1 | 女性 | 1962-07-11 | 56 | 241-0823 | 神奈川県横浜市旭区善部町********** | S14033 | 20150728 | 6-20080506-5 |
3 | CS011215000048 | 芦田 沙耶 | 1 | 女性 | 1992-02-01 | 27 | 223-0062 | 神奈川県横浜市港北区日吉本町********** | S14011 | 20150228 | C-20100421-9 |
4 | CS040412000191 | 川井 郁恵 | 1 | 女性 | 1977-01-05 | 42 | 226-0021 | 神奈川県横浜市緑区北八朔町********** | S14040 | 20151101 | 1-20091025-4 |
5 | CS009315000023 | 皆川 文世 | 1 | 女性 | 1980-04-15 | 38 | 154-0012 | 東京都世田谷区駒沢********** | S13009 | 20150319 | 5-20080322-1 |
6 | CS015315000033 | 福士 璃奈子 | 1 | 女性 | 1983-03-17 | 36 | 135-0043 | 東京都江東区塩浜********** | S13015 | 20141024 | 4-20080219-3 |
7 | CS023513000066 | 神戸 そら | 1 | 女性 | 1961-12-17 | 57 | 210-0005 | 神奈川県川崎市川崎区東田町********** | S14023 | 20150915 | 5-20100524-9 |
8 | CS035513000134 | 市川 美帆 | 1 | 女性 | 1960-03-27 | 59 | 156-0053 | 東京都世田谷区桜********** | S13035 | 20150227 | 8-20100711-9 |
9 | CS001515000263 | 高松 夏空 | 1 | 女性 | 1962-11-09 | 56 | 144-0051 | 東京都大田区西蒲田********** | S13001 | 20160812 | 1-20100804-1 |
10 | CS040314000027 | 鶴田 きみまろ | 9 | 不明 | 1986-03-26 | 33 | 226-0027 | 神奈川県横浜市緑区長津田********** | S14040 | 20150122 | 2-20080426-4 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_customer
. Here's a breakdown of the code:df_customer
: This is the data frame being used.%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(grepl("[1-9]$", status_cd))
: This is filtering the data frame to only include rows where thestatus_cd
column ends with a number from 1 to 9. Thegrepl()
function returns a logical vector indicating whether the regular expression "[1-9]$" matches each element of thestatus_cd
column. The regular expression "[1-9]$" matches any string that ends with a number from 1 to 9.head(., n = 10)
: This is selecting the first 10 rows of the resulting data frame using thehead()
function.
So, putting it all together, the code is filtering thedf_customer
data frame to only include rows where thestatus_cd
column ends with a number from 1 to 9, and then selecting the first 10 rows of the resulting data frame using thehead()
function.
R-015: From the customer data (df_customer), extract all data items whose status code (status_cd) begins with the letters A to F and ends with the numbers 1 to 9, and display 10 items.
head(df_customer %>%
filter(grepl("^[A-F].*[1-9]$", status_cd)), n = 10)
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | |
1 | CS011215000048 | 芦田 沙耶 | 1 | 女性 | 1992-02-01 | 27 | 223-0062 | 神奈川県横浜市港北区日吉本町********** | S14011 | 20150228 | C-20100421-9 |
2 | CS022513000105 | 島村 貴美子 | 1 | 女性 | 1962-03-12 | 57 | 249-0002 | 神奈川県逗子市山の根********** | S14022 | 20150320 | A-20091115-7 |
3 | CS001515000096 | 水野 陽子 | 9 | 不明 | 1960-11-29 | 58 | 144-0053 | 東京都大田区蒲田本町********** | S13001 | 20150614 | A-20100724-7 |
4 | CS013615000053 | 西脇 季衣 | 1 | 女性 | 1953-10-18 | 65 | 261-0026 | 千葉県千葉市美浜区幕張西********** | S12013 | 20150128 | B-20100329-6 |
5 | CS020412000161 | 小宮 薫 | 1 | 女性 | 1974-05-21 | 44 | 174-0042 | 東京都板橋区東坂下********** | S13020 | 20150822 | B-20081021-3 |
6 | CS001215000097 | 竹中 あさみ | 1 | 女性 | 1990-07-25 | 28 | 146-0095 | 東京都大田区多摩川********** | S13001 | 20170315 | A-20100211-2 |
7 | CS035212000007 | 内村 恵梨香 | 1 | 女性 | 1990-12-04 | 28 | 152-0023 | 東京都目黒区八雲********** | S13035 | 20151013 | B-20101018-6 |
8 | CS002515000386 | 野田 コウ | 1 | 女性 | 1963-05-30 | 55 | 185-0013 | 東京都国分寺市西恋ケ窪********** | S13002 | 20160410 | C-20100127-8 |
9 | CS001615000372 | 稲垣 寿々花 | 1 | 女性 | 1956-10-29 | 62 | 144-0035 | 東京都大田区南蒲田********** | S13001 | 20170403 | A-20100104-1 |
10 | CS032512000121 | 松井 知世 | 1 | 女性 | 1962-09-04 | 56 | 210-0011 | 神奈川県川崎市川崎区富士見********** | S13032 | 20150727 | A-20100103-5 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_customer
. Here's a breakdown of the code:df_customer
: This is the data frame being used.%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(grepl("^[A-F].*[1-9]$", status_cd))
: This is filtering the data frame to only include rows where thestatus_cd
column starts with the letters A, B, C, D, E or F, and ends with a number from 1 to 9. Thegrepl()
function returns a logical vector indicating whether the regular expression "^[A-F].[1-9]$" matches each element of thestatus_cd
column. The regular expression "^[A-F].[1-9]$" matches any string that starts with the letters A, B, C, D, E or F, followed by zero or more characters of any type, and ends with a number from 1 to 9.head(., n = 10)
: This is selecting the first 10 rows of the resulting data frame using thehead()
function.
So, putting it all together, the code is filtering thedf_customer
data frame to only include rows where thestatus_cd
column starts with the letters A, B, C, D, E or F, and ends with a number from 1 to 9, and then selecting the first 10 rows of the resulting data frame using thehead()
function.
R-016: From the shop data (df_store), display all items of data where the telephone number (tel_no) is 3 digits – 3 digits – 4 digits
df_store %>%
filter(grepl("^[0-9]{3}-[0-9]{3}-[0-9]{4}$", tel_no))
store_cd | store_name | prefecture_cd | prefecture | address | address_kana | tel_no | longitude | latitude | floor_area |
---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
S12014 | 千草台店 | 12 | 千葉県 | 千葉県千葉市稲毛区千草台一丁目 | チバケンチバシイナゲクチグサダイイッチョウメ | 043-123-4003 | 140.1180 | 35.63559 | 1698 |
S13002 | 国分寺店 | 13 | 東京都 | 東京都国分寺市本多二丁目 | トウキョウトコクブンジシホンダニチョウメ | 042-123-4008 | 139.4802 | 35.70566 | 1735 |
S14010 | 菊名店 | 14 | 神奈川県 | 神奈川県横浜市港北区菊名一丁目 | カナガワケンヨコハマシコウホククキクナイッチョウメ | 045-123-4032 | 139.6326 | 35.50049 | 1732 |
S14033 | 阿久和店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区阿久和西一丁目 | カナガワケンヨコハマシセヤクアクワニシイッチョウメ | 045-123-4043 | 139.4961 | 35.45918 | 1495 |
S14036 | 相模原中央店 | 14 | 神奈川県 | 神奈川県相模原市中央二丁目 | カナガワケンサガミハラシチュウオウニチョウメ | 042-123-4045 | 139.3716 | 35.57327 | 1679 |
S14040 | 長津田店 | 14 | 神奈川県 | 神奈川県横浜市緑区長津田みなみ台五丁目 | カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ | 045-123-4046 | 139.4994 | 35.52398 | 1548 |
S14050 | 阿久和西店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区阿久和西一丁目 | カナガワケンヨコハマシセヤクアクワニシイッチョウメ | 045-123-4053 | 139.4961 | 35.45918 | 1830 |
S13052 | 森野店 | 13 | 東京都 | 東京都町田市森野三丁目 | トウキョウトマチダシモリノサンチョウメ | 042-123-4030 | 139.4383 | 35.55293 | 1087 |
S14028 | 二ツ橋店 | 14 | 神奈川県 | 神奈川県横浜市瀬谷区二ツ橋町 | カナガワケンヨコハマシセヤクフタツバシチョウ | 045-123-4042 | 139.4963 | 35.46304 | 1574 |
S14012 | 本牧和田店 | 14 | 神奈川県 | 神奈川県横浜市中区本牧和田 | カナガワケンヨコハマシナカクホンモクワダ | 045-123-4034 | 139.6582 | 35.42156 | 1341 |
S14046 | 北山田店 | 14 | 神奈川県 | 神奈川県横浜市都筑区北山田一丁目 | カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ | 045-123-4049 | 139.5916 | 35.56189 | 831 |
S14022 | 逗子店 | 14 | 神奈川県 | 神奈川県逗子市逗子一丁目 | カナガワケンズシシズシイッチョウメ | 046-123-4036 | 139.5789 | 35.29642 | 1838 |
S14011 | 日吉本町店 | 14 | 神奈川県 | 神奈川県横浜市港北区日吉本町四丁目 | カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ | 045-123-4033 | 139.6316 | 35.54655 | 890 |
S13016 | 小金井店 | 13 | 東京都 | 東京都小金井市本町一丁目 | トウキョウトコガネイシホンチョウイッチョウメ | 042-123-4015 | 139.5094 | 35.70018 | 1399 |
S14034 | 川崎野川店 | 14 | 神奈川県 | 神奈川県川崎市宮前区野川 | カナガワケンカワサキシミヤマエクノガワ | 044-123-4044 | 139.5998 | 35.57693 | 1318 |
S14048 | 中川中央店 | 14 | 神奈川県 | 神奈川県横浜市都筑区中川中央二丁目 | カナガワケンヨコハマシツヅキクナカガワチュウオウニチョウメ | 045-123-4051 | 139.5758 | 35.54912 | 1657 |
S12007 | 佐倉店 | 12 | 千葉県 | 千葉県佐倉市上志津 | チバケンサクラシカミシヅ | 043-123-4001 | 140.1452 | 35.71872 | 1895 |
S14026 | 辻堂西海岸店 | 14 | 神奈川県 | 神奈川県藤沢市辻堂西海岸二丁目 | カナガワケンフジサワシツジドウニシカイガンニチョウメ | 046-123-4040 | 139.4466 | 35.32464 | 1732 |
S13041 | 八王子店 | 13 | 東京都 | 東京都八王子市大塚 | トウキョウトハチオウジシオオツカ | 042-123-4026 | 139.4235 | 35.63787 | 810 |
S14049 | 川崎大師店 | 14 | 神奈川県 | 神奈川県川崎市川崎区中瀬三丁目 | カナガワケンカワサキシカワサキクナカゼサンチョウメ | 044-123-4052 | 139.7327 | 35.53759 | 962 |
S14023 | 川崎店 | 14 | 神奈川県 | 神奈川県川崎市川崎区本町二丁目 | カナガワケンカワサキシカワサキクホンチョウニチョウメ | 044-123-4037 | 139.7028 | 35.53599 | 1804 |
S13018 | 清瀬店 | 13 | 東京都 | 東京都清瀬市松山一丁目 | トウキョウトキヨセシマツヤマイッチョウメ | 042-123-4017 | 139.5178 | 35.76885 | 1220 |
S14027 | 南藤沢店 | 14 | 神奈川県 | 神奈川県藤沢市南藤沢 | カナガワケンフジサワシミナミフジサワ | 046-123-4041 | 139.4896 | 35.33762 | 1521 |
S14021 | 伊勢原店 | 14 | 神奈川県 | 神奈川県伊勢原市伊勢原四丁目 | カナガワケンイセハラシイセハラヨンチョウメ | 046-123-4035 | 139.3129 | 35.40169 | 962 |
S14047 | 相模原店 | 14 | 神奈川県 | 神奈川県相模原市千代田六丁目 | カナガワケンサガミハラシチヨダロクチョウメ | 042-123-4050 | 139.3748 | 35.55959 | 1047 |
S12013 | 習志野店 | 12 | 千葉県 | 千葉県習志野市芝園一丁目 | チバケンナラシノシシバゾノイッチョウメ | 047-123-4002 | 140.0220 | 35.66122 | 808 |
S14042 | 新山下店 | 14 | 神奈川県 | 神奈川県横浜市中区新山下二丁目 | カナガワケンヨコハマシナカクシンヤマシタニチョウメ | 045-123-4047 | 139.6593 | 35.43894 | 1044 |
S12030 | 八幡店 | 12 | 千葉県 | 千葉県市川市八幡三丁目 | チバケンイチカワシヤワタサンチョウメ | 047-123-4005 | 139.9240 | 35.72318 | 1162 |
S14025 | 大和店 | 14 | 神奈川県 | 神奈川県大和市下和田 | カナガワケンヤマトシシモワダ | 046-123-4039 | 139.4680 | 35.43414 | 1011 |
S14045 | 厚木店 | 14 | 神奈川県 | 神奈川県厚木市中町二丁目 | カナガワケンアツギシナカチョウニチョウメ | 046-123-4048 | 139.3651 | 35.44182 | 980 |
S12029 | 東野店 | 12 | 千葉県 | 千葉県浦安市東野一丁目 | チバケンウラヤスシヒガシノイッチョウメ | 047-123-4004 | 139.8968 | 35.65086 | 1101 |
S12053 | 高洲店 | 12 | 千葉県 | 千葉県浦安市高洲五丁目 | チバケンウラヤスシタカスゴチョウメ | 047-123-4006 | 139.9176 | 35.63755 | 1555 |
S14024 | 三田店 | 14 | 神奈川県 | 神奈川県川崎市多摩区三田四丁目 | カナガワケンカワサキシタマクミタヨンチョウメ | 044-123-4038 | 139.5424 | 35.60770 | 972 |
S14006 | 葛が谷店 | 14 | 神奈川県 | 神奈川県横浜市都筑区葛が谷 | カナガワケンヨコハマシツヅキククズガヤ | 045-123-4031 | 139.5633 | 35.53573 | 1886 |
Commentary :
This code is using thedplyr
package in R to manipulate a data frame calleddf_store
. Here's a breakdown of the code:df_store
: This is the data frame being used.%>%
: This is the pipe operator indplyr
, which allows you to chain multiple data manipulation functions together.filter(grepl("^[0-9]{3}-[0-9]{3}-[0-9]{4}$", tel_no))
: This is filtering the data frame to only include rows where thetel_no
column matches a specific phone number format. Thegrepl()
function returns a logical vector indicating whether the regular expression "^[0-9]{3}-[0-9]{3}-[0-9]{4}$" matches each element of thetel_no
column. The regular expression "^[0-9]{3}-[0-9]{3}-[0-9]{4}$" matches any string that starts with three digits, followed by a hyphen, then three more digits, another hyphen, and finally four more digits.head(.)
: This is selecting the first 6 rows of the resulting data frame using thehead()
function.
So, putting it all together, the code is filtering thedf_store
data frame to only include rows where thetel_no
column matches a specific phone number format, and then selecting the first 6 rows of the resulting data frame using thehead()
function.
R-017: Sort customer data (df_customer) by birth_day(birth_day) in order of age and display all 10 items from the top.
head( df_customer[order(df_customer$birth_day), ], n = 10 )
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | |
18818 | CS003813000014 | 村山 菜々美 | 1 | 女性 | 1928-11-26 | 90 | 182-0007 | 東京都調布市菊野台********** | S13003 | 20160214 | 0-00000000-0 |
12329 | CS026813000004 | 吉村 朝陽 | 1 | 女性 | 1928-12-14 | 90 | 251-0043 | 神奈川県藤沢市辻堂元町********** | S14026 | 20150723 | 0-00000000-0 |
15683 | CS018811000003 | 熊沢 美里 | 1 | 女性 | 1929-01-07 | 90 | 204-0004 | 東京都清瀬市野塩********** | S13018 | 20150403 | 0-00000000-0 |
15303 | CS027803000004 | 内村 拓郎 | 0 | 男性 | 1929-01-12 | 90 | 251-0031 | 神奈川県藤沢市鵠沼藤が谷********** | S14027 | 20151227 | 0-00000000-0 |
1682 | CS013801000003 | 天野 拓郎 | 0 | 男性 | 1929-01-15 | 90 | 274-0824 | 千葉県船橋市前原東********** | S12013 | 20160120 | 0-00000000-0 |
7512 | CS001814000022 | 鶴田 里穂 | 1 | 女性 | 1929-01-28 | 90 | 144-0045 | 東京都大田区南六郷********** | S13001 | 20161012 | A-20090415-7 |
2379 | CS016815000002 | 山元 美紀 | 1 | 女性 | 1929-02-22 | 90 | 184-0005 | 東京都小金井市桜町********** | S13016 | 20150629 | C-20090923-C |
4681 | CS009815000003 | 中田 里穂 | 1 | 女性 | 1929-04-08 | 89 | 154-0014 | 東京都世田谷区新町********** | S13009 | 20150421 | D-20091021-E |
6306 | CS012813000013 | 宇野 南朋 | 1 | 女性 | 1929-04-09 | 89 | 231-0806 | 神奈川県横浜市中区本牧町********** | S14012 | 20150712 | 0-00000000-0 |
16071 | CS005813000015 | 金谷 恵梨香 | 1 | 女性 | 1929-04-09 | 89 | 165-0032 | 東京都中野区鷺宮********** | S13005 | 20150506 | 0-00000000-0 |
Commentary :
This code is sorting thedf_customer
data frame by thebirth_day
column in ascending order and selecting the first 10 rows of the resulting data frame using thehead()
function. Here's a breakdown of the code:df_customer
: This is the data frame being used.[order(df_customer$birth_day), ]
: This is subsetting thedf_customer
data frame to include all columns, but sorting the rows by thebirth_day
column in ascending order. Theorder()
function returns a vector of indices that sort thebirth_day
column in ascending order. By applying these indices to the rows of the data frame with the[
operator, the data frame is sorted by thebirth_day
column.head(., n = 10)
: This is selecting the first 10 rows of the resulting data frame using thehead()
function.
So, putting it all together, the code is sorting thedf_customer
data frame by thebirth_day
column in ascending order, and then selecting the first 10 rows of the resulting data frame using thehead()
function. This code is useful when you want to quickly see the oldest customers in thedf_customer
data frame.
R-018: Sort customer data (df_customer) by birth_day(birth_day) in descending order, displaying all 10 items from the top.
head( df_customer[order(df_customer$birth_day, decreasing = TRUE), ], n = 10 )
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <date> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | |
15640 | CS035114000004 | 大村 美里 | 1 | 女性 | 2007-11-25 | 11 | 156-0053 | 東京都世田谷区桜********** | S13035 | 20150619 | 6-20091205-6 |
7469 | CS022103000002 | 福山 はじめ | 9 | 不明 | 2007-10-02 | 11 | 249-0006 | 神奈川県逗子市逗子********** | S14022 | 20160909 | 0-00000000-0 |
10746 | CS002113000009 | 柴田 真悠子 | 1 | 女性 | 2007-09-17 | 11 | 184-0014 | 東京都小金井市貫井南町********** | S13002 | 20160304 | 0-00000000-0 |
19812 | CS004115000014 | 松井 京子 | 1 | 女性 | 2007-08-09 | 11 | 165-0031 | 東京都中野区上鷺宮********** | S13004 | 20161120 | 1-20081231-1 |
7040 | CS002114000010 | 山内 遥 | 1 | 女性 | 2007-06-03 | 11 | 184-0015 | 東京都小金井市貫井北町********** | S13002 | 20160920 | 6-20100510-1 |
3671 | CS025115000002 | 小柳 夏希 | 1 | 女性 | 2007-04-18 | 11 | 245-0018 | 神奈川県横浜市泉区上飯田町********** | S14025 | 20160116 | D-20100913-D |
12494 | CS002113000025 | 広末 まなみ | 1 | 女性 | 2007-03-30 | 12 | 184-0015 | 東京都小金井市貫井北町********** | S13002 | 20171030 | 0-00000000-0 |
15978 | CS033112000003 | 長野 美紀 | 1 | 女性 | 2007-03-22 | 12 | 245-0051 | 神奈川県横浜市戸塚区名瀬町********** | S14033 | 20150606 | 0-00000000-0 |
5717 | CS007115000006 | 福岡 瞬 | 1 | 女性 | 2007-03-10 | 12 | 285-0845 | 千葉県佐倉市西志津********** | S12007 | 20151118 | F-20101016-F |
15098 | CS014113000008 | 矢口 莉緒 | 1 | 女性 | 2007-03-05 | 12 | 260-0041 | 千葉県千葉市中央区東千葉********** | S12014 | 20150622 | 3-20091108-6 |
Commentary :
This code is sorting thedf_customer
data frame by thebirth_day
column in descending order and selecting the first 10 rows of the resulting data frame using thehead()
function. Here's a breakdown of the code:df_customer
: This is the data frame being used.[order(df_customer$birth_day, decreasing = TRUE), ]
: This is subsetting thedf_customer
data frame to include all columns, but sorting the rows by thebirth_day
column in descending order. Theorder()
function returns a vector of indices that sort thebirth_day
column in ascending order by default, but by settingdecreasing = TRUE
, it sorts the column in descending order instead. By applying these indices to the rows of the data frame with the[
operator, the data frame is sorted by thebirth_day
column in descending order.head(., n = 10)
: This is selecting the first 10 rows of the resulting data frame using thehead()
function.
So, putting it all together, the code is sorting thedf_customer
data frame by thebirth_day
column in descending order, and then selecting the first 10 rows of the resulting data frame using thehead()
function. This code is useful when you want to quickly see the youngest customers in thedf_customer
data frame.
R-019: Rank the receipt details data (df_receipt) in descending order of sales amount per item (amount) and display 10 items from the top. The items shall display the customer ID (customer_id), sales amount (amount) and the rank assigned. If the sales amount (amount) is equal, the same rank shall be assigned.
df_receipt[c("customer_id", "amount")] %>%
mutate(ranking = min_rank(desc(amount))) %>%
arrange(ranking) %>%
slice(1:10)
customer_id | amount | ranking |
---|---|---|
<chr> | <int> | <int> |
CS011415000006 | 10925 | 1 |
ZZ000000000000 | 6800 | 2 |
CS028605000002 | 5780 | 3 |
CS015515000034 | 5480 | 4 |
ZZ000000000000 | 5480 | 4 |
ZZ000000000000 | 5480 | 4 |
ZZ000000000000 | 5440 | 7 |
CS021515000089 | 5440 | 7 |
CS015515000083 | 5280 | 9 |
CS017414000114 | 5280 | 9 |
Commentary :
This code is selecting the columnscustomer_id
andamount
from thedf_receipt
data frame, adding a new column calledranking
that indicates the ranking of each customer by their transaction amount (with the highest amount having a ranking of 1), sorting the data frame by theranking
column, and selecting the top 10 rows of the resulting data frame using theslice()
function. Here's a breakdown of the code:df_receipt[c("customer_id", "amount")]
: This is subsetting thedf_receipt
data frame to include only thecustomer_id
andamount
columns.%>%
: This is the pipe operator, which passes the output of the previous operation as the input to the next operation.mutate(ranking = min_rank(desc(amount)))
: This is adding a new column to the data frame calledranking
that indicates the ranking of each customer by their transaction amount. Themin_rank()
function from thedplyr
package calculates the minimum rank of each value in a vector, with ties being broken arbitrarily. By applying this function to theamount
column after taking the negative (desc()
function) so that the highest value has a rank of 1, we obtain a ranking of the customers by their transaction amount.arrange(ranking)
: This is sorting the data frame by theranking
column in ascending order.slice(1:10)
: This is selecting the top 10 rows of the resulting data frame using theslice()
function.
So, putting it all together, the code is selecting thecustomer_id
andamount
columns from thedf_receipt
data frame, adding a new column calledranking
that indicates the ranking of each customer by their transaction amount, sorting the data frame by theranking
column, and selecting the top 10 rows of the resulting data frame. This code is useful when you want to quickly see the top 10 customers by transaction amount in thedf_receipt
data frame.
R-020: For receipt details data (df_receipt), assign ranks in descending order of sales amount (amount) per item and display the first 10 items. Items shall display the customer ID (customer_id), sales amount (amount) and the rank assigned. Even if the sales amount (amount) is equal, a different rank should be assigned.
df_receipt[c("customer_id", "amount")] %>%
mutate(ranking = row_number(desc(amount))) %>%
arrange(ranking) %>%
slice(1:10)
customer_id | amount | ranking |
---|---|---|
<chr> | <int> | <int> |
CS011415000006 | 10925 | 1 |
ZZ000000000000 | 6800 | 2 |
CS028605000002 | 5780 | 3 |
CS015515000034 | 5480 | 4 |
ZZ000000000000 | 5480 | 5 |
ZZ000000000000 | 5480 | 6 |
ZZ000000000000 | 5440 | 7 |
CS021515000089 | 5440 | 8 |
CS015515000083 | 5280 | 9 |
CS017414000114 | 5280 | 10 |
Commentary :
This code performs a series of operations on a data frame calleddf_receipt
containing columns "customer_id" and "amount".
Here's a step-by-step explanation of what the code does:df_receipt[c("customer_id", "amount")]
: This selects only the "customer_id" and "amount" columns from thedf_receipt
data frame. The resulting data frame contains only these two columns.%>%
: This is the pipe operator in R, which is used to chain together multiple operations on a data frame. The output of one operation is passed as input to the next operation.mutate(ranking = row_number(desc(amount)))
: This creates a new column called "ranking" in the data frame using themutate
function. The "ranking" column is based on the "amount" column, with the highest value of "amount" receiving a rank of 1, the second-highest value receiving a rank of 2, and so on. Thedesc
function is used to sort the "amount" column in descending order.arrange(ranking)
: This sorts the data frame by the "ranking" column in ascending order (i.e., from 1 to n).slice(1:10)
: This selects the first 10 rows of the sorted data frame (i.e., the 10 rows with the highest "amount" values and their corresponding "customer_id" values). Theslice
function is used to subset rows from the data frame.
In summary, the code selects the "customer_id" and "amount" columns from a data frame, ranks the "amount" values in descending order, sorts the data frame by the rank, and selects the top 10 rows (i.e., the 10 customers with the highest "amount" values).
Comment