Data Science 100 Knocks (Structured Data Processing) – R Part3 (Q41 to Q60)

Articles in English
Commentary :

This code is also manipulating a data frame called df_receipt using the dplyr package in R. Here is a step-by-step explanation of what the code is doing:

df_sum <-df_receipt %>% group_by(sales_ymd) %>% summarise(sum_amount = sum(amount), .groups = "drop") - This is similar to the first line of code in the previous example, which groups the df_receipt data frame by the sales_ymd column, calculates the sum of the amount column for each unique sales_ymd group, and stores the resulting data frame in a new variable called df_sum.

for (i in 1:3) { - This initiates a for loop that will iterate 3 times.

df_tmp <- df_sum %>% mutate(lag_ymd = lag(sales_ymd, n = i), lag_amount = lag(sum_amount, n = i)) - This creates a new temporary data frame called df_tmp that is created by lagging the sales_ymd and sum_amount columns in the df_sum data frame by i days using the lag() function from dplyr. The resulting data frame will have three columns: sales_ymd, sum_amount, lag_ymd, and lag_amount.

if (i == 1) { df_lag <- df_tmp } - If this is the first iteration of the loop, the df_lag data frame is set equal to df_tmp.

else { df_lag <- rbind(df_lag, df_tmp) } - If this is not the first iteration of the loop, df_tmp is appended to the existing df_lag data frame using rbind() function from dplyr.

} - The for loop ends here.

df_lag %>% drop_na(everything()) %>% arrange(sales_ymd, lag_ymd) %>% slice(1:10) - This takes the df_lag data frame that was created in the loop, drops any rows with missing data using drop_na(), arranges the rows by sales_ymd and lag_ymd using arrange(), and then selects the first 10 rows of the resulting data frame using slice().

The purpose of this code is to create a new data frame called df_lag that contains the lagged values of the sales_ymd and sum_amount columns for each of the previous 3 days. The resulting data frame can be used to examine the sales trends over the previous 3 days, and may be useful in predicting future sales. The final line of code simply selects the first 10 rows of the resulting data frame for display or further analysis.
 
 
Commentary :

This code creates a new data frame called df_tmp by selecting the customer_id and application_date columns from the df_customer data frame.

The df_customer["customer_id"] syntax selects the customer_id column as a data frame (not as a vector), which is necessary for combining it with another column using cbind().

The strptime() function is used to convert the application_date column, which is a character string in the %Y%m%d format representing the year, month, and day of application as an 8-digit number, to a POSIXlt class object that represents date and time in R. The %Y%m%d argument specifies the format of the character string to be converted.

colnames() function is used to assign new column names to the data frame, so that the columns are named "customer_id" and "application_date".

Finally, head() function is used to show the first 10 rows of the df_tmp data frame.

Overall, this code creates a new data frame with two columns that can be used to merge with other data frames based on customer ID or application date.
 
 
Commentary :

This code performs the following actions:

It takes a data frame df_customer as input.

It creates a new column prefecture_cd in the df_customer data frame.

The values of the prefecture_cd column are based on the values of the address column.

If the first three characters of the address column are "埼玉県", the value of the prefecture_cd column is set to "11".

If the first three characters of the address column are "千葉県", the value of the prefecture_cd column is set to "12".

If the first three characters of the address column are "東京都", the value of the prefecture_cd column is set to "13".

If the first three characters of the address column are "神奈川", the value of the prefecture_cd column is set to "14".

The customer_id, address, and prefecture_cd columns are selected for the output.

The slice(1:10) function is used to show the first 10 rows of the resulting data frame.
 
 
Commentary :

This code manipulates a data frame df_customer containing customer data. The code selects three columns: customer_id, birth_day, and age. Then, it calculates the era of each customer based on their age.

The first mutate function calculates the era of each customer by dividing their age by 10, truncating the result to the nearest integer, and then multiplying the integer by 10. For example, if a customer is 43 years old, their era will be 40. If a customer is 25 years old, their era will be 20.

The second mutate function replaces era values greater than or equal to 60 with 60. This is done to group all customers aged 60 or older into a single era category.

Finally, the code selects the first 10 rows of the resulting data frame, which contains the customer ID, birth date, age, and calculated era for each customer.
 
 
Commentary :

This code creates dummy variables for a categorical variable "gender_cd" in the data frame df_customer. It first creates a model object using the dummyVars function from the "caret" package. The fullRank argument is set to FALSE to remove the first level of the categorical variable (which can be inferred from the other levels).

The second line applies the model to the original data frame df_customer to create a new data frame with the dummy variables. The predict function with the dummy_gender_model as the first argument and df_customer as the second argument applies the model to the data frame to create the dummy variables.

Finally, the code combines the customer_id column from df_customer with the dummy variables using the cbind function and displays the first 10 rows using head().
 
 

 

Commentary :

This code performs the following operations on the df_receipt data frame:

Filter out any rows where the customer_id starts with the letter "Z".

Group the remaining rows by customer_id.

Calculate the sum of amount for each group of customer_id.

Mutate the resulting data frame to add a new column scale_amount which is a scaled version of sum_amount. The scale function scales sum_amount so that its minimum value becomes 0 and its maximum value becomes 1.

Slice the resulting data frame to keep only the first 10 rows.

In summary, this code calculates the total amount of money spent by each customer in df_receipt, and then scales those values so that the minimum value becomes 0 and the maximum value becomes 1. The resulting scale_amount values can be useful for certain types of analysis, such as when you want to compare the relative spending patterns of different customers.
 
Data Science 100 Knocks (Structured Data Processing) - R
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part1 (Q1 to Q20)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part2 (Q21 to Q40)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part3 (Q41 to Q60)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part4 (Q61 to Q80)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part5 (Q81 to Q100)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - SQL
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

 

Comment