Data Science 100 Knocks (Structured Data Processing) – R Part2 (Q21 to Q40)

Articles in English
 
 
Commentary :

This code is using the dplyr package to perform data manipulation on a data frame called df_receipt. Here is a step-by-step explanation of what the code is doing:

%>% is a pipe operator used to chain functions together, so the output of one function becomes the input of the next function.

The group_by function groups the data frame by store_cd and product_cd.

The summarise function calculates a summary statistic for each group. In this case, it is counting the number of rows in each group and creating a new column called cnt.

The .groups = "drop_last" argument is used to drop the last grouping level created by the summarise function, which is the product_cd grouping. This is done to make it easier to filter on the maximum count later on.

The filter function selects rows that meet a certain condition. In this case, it is selecting rows where the cnt column is equal to the maximum value of the cnt column in the data frame.

The ungroup function removes all grouping levels from the data frame.

The slice function selects a subset of rows based on their position. In this case, it is selecting the first 10 rows of the data frame.

So overall, this code is grouping the data by store and product, calculating the count of each group, filtering for the groups with the highest count, and then selecting the top 10 results.
 
 
Commentary :

This code performs several operations to obtain two summary tables for the customer data, and then performs a full join operation between these two tables based on the "customer_id" column. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

Let's break down the code step by step:

df_data <- df_receipt %>% filter(!grepl("^Z", customer_id)) %>% group_by(customer_id): This code first filters the df_receipt data frame to exclude any rows where the "customer_id" column starts with the letter "Z". The resulting data frame is then grouped by the "customer_id" column. The resulting df_data data frame contains all the rows from df_receipt with "Z" customer IDs removed, and is grouped by "customer_id".

df_cnt <- df_data %>% summarise(come_days = n_distinct(sales_ymd), .groups = "drop") %>% arrange(desc(come_days), customer_id) %>% slice(1:20): This code first summarizes the df_data data frame by counting the number of distinct "sales_ymd" values for each customer using the n_distinct() function. The resulting data frame, df_cnt, contains the total number of unique sales days for each customer. The data frame is then arranged in descending order of "come_days" followed by "customer_id". The first 20 rows of the resulting data frame are then selected using the slice() function.

df_sum <- df_data %>% summarise(sum_amount = sum(amount), .groups = "drop") %>% arrange(desc(sum_amount)) %>% slice(1:20): This code summarizes the df_data data frame by calculating the sum of the "amount" column for each customer using the sum() function. The resulting data frame, df_sum, contains the total amount spent by each customer. The data frame is then arranged in descending order of "sum_amount", and the first 20 rows of the resulting data frame are selected using the slice() function.

full_join(df_cnt, df_sum, by = "customer_id"): This code performs a full join operation between df_cnt and df_sum based on the "customer_id" column. The result is a new data frame that contains all the rows from both data frames, matched by "customer_id". The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.

So, the overall purpose of this code is to obtain two summary tables for the customer data, and then perform a full join operation between these two tables. The resulting data frame contains the top 20 customers with the most distinct sales days and the top 20 customers with the highest total amount spent.
 
 
Commentary :

This code performs a full join operation between the df_store and df_product data frames based on a common "key" column that is added to both data frames. The resulting data frame is then used to calculate the number of rows using the nrow() function.

Here is a breakdown of the code:

df_store_tmp <- df_store: This code creates a new data frame called df_store_tmp that is a copy of the df_store data frame. This is done to avoid modifying the original data frame.

df_product_tmp <- df_product: This code creates a new data frame called df_product_tmp that is a copy of the df_product data frame. This is done to avoid modifying the original data frame.

df_store_tmp["key"] <- 0: This code adds a new column called "key" to the df_store_tmp data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.

df_product_tmp["key"] <- 0: This code adds a new column called "key" to the df_product_tmp data frame and initializes all the values to zero. This column is added to provide a common column for the join operation.

full_join(df_store_tmp, df_product_tmp, by = "key"): This code performs a full join operation between the df_store_tmp and df_product_tmp data frames based on the common "key" column. Since both data frames have the same value in this column for all rows, a full join will result in a Cartesian product of the two data frames, meaning that every row from df_store_tmp will be paired with every row from df_product_tmp.

nrow(full_join(df_store_tmp, df_product_tmp, by = "key")): This code calculates the number of rows in the resulting data frame from the full join operation using the nrow() function. The result is the total number of combinations of rows between df_store and df_product.

In summary, this code creates copies of the df_store and df_product data frames, adds a common "key" column to both data frames with all values set to zero, performs a full join operation between the two data frames based on the common "key" column, and calculates the number of resulting rows. The purpose of this code is to determine the total number of combinations of rows between the df_store and df_product data frames.
 
 
Data Science 100 Knocks (Structured Data Processing) - R
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part1 (Q1 to Q20)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part2 (Q21 to Q40)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part3 (Q41 to Q60)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part4 (Q61 to Q80)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - R Part5 (Q81 to Q100)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - SQL
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice SQL, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

 

Comment