Data Science 100 Knocks (Structured Data Processing) – Python Part4 (Q61 to Q80)

Articles in English
 
You can get preprocess_knock files to practice python, SQL, R from my github account
 
ChanhiYasutomi - Repositories
ChanhiYasutomi has 19 repositories available. Follow their code on GitHub.
 

 

 

 
Commentary :

This code also manipulats pandas DataFrame objects, df_receipt and df_product.

Here's a breakdown of the code line-by-line:

df_tmp_1 = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index().rename(columns={'amount':'sum_all'}): This line groups the df_receipt DataFrame by "customer_id", calculates the sum of the "amount" column for each group, and then resets the index of the resulting DataFrame. The resulting DataFrame is then renamed to "sum_all" using the .rename() method.

df_tmp_2 = pd.merge(df_receipt, df_product.query('category_major_cd == "07"'), how='inner', on='product_cd').groupby('customer_id').agg({'amount':'sum'}).reset_index().rename(columns={'amount':'sum_07'}): This line first filters df_product to include only rows where the "category_major_cd" column equals "07". It then merges this filtered DataFrame with df_receipt on the "product_cd" column using an inner join. The resulting DataFrame is then grouped by "customer_id", the sum of the "amount" column is calculated for each group, and the index is reset. The resulting DataFrame is then renamed to "sum_07" using the .rename() method.

df_tmp_3 = pd.merge(df_tmp_1, df_tmp_2, how='inner', on='customer_id'): This line merges df_tmp_1 and df_tmp_2 on the "customer_id" column using an inner join. This creates a new DataFrame that contains the "sum_all" and "sum_07" columns for each customer.

df_tmp_3['sales_rate'] = df_tmp_3['sum_07'] / df_tmp_3['sum_all']: This line calculates a new column in df_tmp_3 called "sales_rate", which is the result of dividing the "sum_07" column by the "sum_all" column. This calculates the percentage of a customer's total purchases that are from the "07" category.

df_tmp_3.head(10): This line prints out the first 10 rows of the df_tmp_3 DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.
 
Commentary :

This code is performing data manipulation on a DataFrame named df_receipt and another DataFrame named df_customer. It calculates the number of days between a customer's application date and their purchase date.

Here's a step-by-step breakdown of the code:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): creates a new DataFrame named df_tmp by selecting only the columns "customer_id" and "sales_ymd" from the df_receipt DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): performs an inner join on df_tmp and the df_customer DataFrame, selecting only the "customer_id" and "application_date" columns from the df_customer DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in both df_tmp and df_customer.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): converts the "sales_ymd" column of df_tmp to a datetime format using the pd.to_datetime() function. This function converts the column from a string format to a datetime format that can be used for calculations.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): converts the "application_date" column of df_tmp to a datetime format using the pd.to_datetime() function. This function converts the column from a string format to a datetime format that can be used for calculations.

df_tmp['elapsed_days'] = df_tmp['sales_ymd'] - df_tmp['application_date']: calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_days" in df_tmp. This calculation will result in a timedelta object, which represents the difference between two dates in days, hours, minutes, and seconds.

df_tmp['elapsed_days'] = df_tmp['elapsed_days'].dt.days: extracts only the days from the "elapsed_days" column and assigns the result back to the same column. This step is necessary because the "elapsed_days" column currently contains a timedelta object, but we only want to keep the number of days.

df_tmp.head(10): displays the first 10 rows of df_tmp to verify that the calculations were performed correctly.

Overall, this code is useful for analyzing customer behavior and determining how long it takes for customers to make a purchase after applying for a membership or account.
 
 

 

Commentary :

This code is used to remove all rows containing any null (missing) values from a pandas DataFrame df_product and create a new DataFrame df_product_1 that contains only the non-null rows. Here is a step-by-step explanation:

df_product: This is the name of the original DataFrame that is being used.

df_product.copy(): This method creates a copy of the original DataFrame df_product so that the original DataFrame is not modified.

df_product_1: This is the name of the new DataFrame that is being created.

dropna(): This method is used to drop all rows that contain any null values from df_product_1. The inplace=True argument is used to modify df_product_1 in place rather than returning a new DataFrame.

print('Before deletion:', len(df_product)): This line prints the number of rows in the original DataFrame df_product before the null values are dropped.

print('After deletion:', len(df_product_1)): This line prints the number of rows in the new DataFrame df_product_1 after the null values are dropped.

Therefore, this code first creates a copy of the original DataFrame df_product, drops all rows that contain any null values from the copied DataFrame to create a new DataFrame df_product_1, and finally prints the number of rows in the original and new DataFrames before and after the deletion of null values. This code can be used to identify the number of rows that are removed due to null values and to create a clean DataFrame that does not contain any null values.
 
Data Science 100 Knocks (Structured Data Processing) - Python
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Comment