Data Science 100 Knocks (Structured Data Processing) – Python Part3 (Q41 to Q60)

Articles in English
 
You can get preprocess_knock files to practice python, SQL, R from my github account
 
ChanhiYasutomi - Repositories
ChanhiYasutomi has 17 repositories available. Follow their code on GitHub.
 
Commentary :

This code is performing a lag analysis on the sales data.

First, the code groups the sales data by date and sums the sales amount for each date to get the total sales amount for each day. This is stored in df_sales_amount_by_date.

Then, the code enters a for loop that iterates three times, from 1 to 3 (inclusive).

Within each iteration of the for loop, the code creates a new DataFrame called df_tmp by concatenating the df_sales_amount_by_date DataFrame with itself shifted by i rows. This results in a DataFrame where each row contains the sales data for a specific date and the sales data for the ith previous date.

If i is equal to 1, then the resulting DataFrame is stored in df_lag. Otherwise, the DataFrame is appended to df_lag.

The columns of df_lag are then renamed to sales_ymd, amount, lag_ymd, and lag_amount to indicate the date and amount of sales for the current date and the lagged date.

Finally, the code drops any rows containing NaN values (which would occur for the first i rows) and converts the sales amounts to integers before sorting the DataFrame by sales_ymd and lag_ymd and displaying the first 10 rows.

This provides a summary of the total sales amount for each date and the total sales amounts for the 1st, 2nd, and 3rd previous dates, allowing the user to perform lag analysis on the sales data.
 
 
Commentary :

This code does the following:

Merges two dataframes df_receipt and df_customer on the common column customer_id, with an inner join, and saves the result to a new dataframe df_tmp.

Creates a new column era in the df_tmp dataframe, which is calculated by dividing the age of each customer by 10 and then rounding down to the nearest 10.

Calculates the sum of the amount column of the df_tmp dataframe for each unique combination of era and gender_cd values, and saves the result as a new dataframe df_sales_summary using the pd.pivot_table() function. The resulting dataframe has era values as the index and gender_cd values as the columns.

Renames the columns of the df_sales_summary dataframe to be more descriptive, with era as the first column, followed by male, female, and unknown, which represent the sum of the amount column for each respective gender_cd value.
 
 
 
 
 
Commentary :

The code performs the following operations:

Selects rows from df_receipt where customer_id does not start with the letter 'Z' using the query method and assigns it to df_sales_amount.

Selects the customer_id and amount columns from df_sales_amount and calculates the sum of amount for each customer_id using the groupby method. The result is assigned to df_sales_amount.

Adds a new column sales_flg to df_sales_amount which contains 1 if the amount is greater than 2000 and 0 otherwise, using the apply method with a lambda function.

Displays the first 10 rows of the resulting DataFrame df_sales_amount.

So, the resulting DataFrame df_sales_amount contains the customer_id, total amount spent by each customer and a binary flag sales_flg indicating whether the customer's spending is high (greater than 2000) or not.
 
 
Commentary :

This code performs the following tasks:

Creates a copy of the "customer_id" and "postal_cd" columns from the "df_customer" DataFrame and assigns it to a new DataFrame called "df_tmp".

Creates a new column "postal_flg" in "df_tmp" based on the first 3 digits of the "postal_cd" column. If the first 3 digits are between 100 and 209 (inclusive), "postal_flg" is set to 1, otherwise it is set to 0.

Merges "df_tmp" and "df_receipt" on the "customer_id" column using an inner join.

Groups the resulting DataFrame by the "postal_flg" column and counts the number of unique customer IDs in each group using the "nunique()" function.

Returns the resulting DataFrame showing the count of unique customer IDs grouped by the "postal_flg" column.

In summary, this code is used to analyze the number of customers living in specific postal code regions based on their purchase history.
 
 
Commentary :

This code is extracting the prefecture code from the address column of the df_customer DataFrame and creating a new DataFrame df_customer_tmp with columns 'customer_id', 'address', and 'prefecture_cd'.

The code first creates a copy of the 'customer_id' and 'address' columns of the df_customer DataFrame using the copy() method and assigns it to the df_customer_tmp variable.

df_customer_tmp = df_customer[['customer_id', 'address']].copy()
The code then extracts the first 3 characters of the 'address' column using the str accessor and assigns it to a new column 'prefecture_cd' using the map() method.

df_customer_tmp['prefecture_cd'] = df_customer['address'].str[0:3].map({'埼玉県': '11', '千葉県':'12', '東京都':'13', '神奈川':'14'})

The map() method maps each prefecture name to its corresponding prefecture code. The resulting DataFrame df_customer_tmp has columns 'customer_id', 'address', and 'prefecture_cd'.
 
Commentary :

This code computes quartiles of sales amount for each customer, and then assigns each customer to one of four groups based on their sales amount quartile.

Here is a step-by-step breakdown of the code:

df_sales_amount = df_receipt[['customer_id', 'amount']].groupby('customer_id').sum().reset_index() computes the total sales amount for each customer by grouping the df_receipt dataframe by customer_id and summing up the amount column. The resulting dataframe has two columns: customer_id and amount.

pct25 = np.quantile(df_sales_amount['amount'], 0.25), pct50 = np.quantile(df_sales_amount['amount'], 0.5), and pct75 = np.quantile(df_sales_amount['amount'], 0.75) compute the 25th, 50th, and 75th percentiles of the sales amount distribution.

def pct_group(x): ... defines a function pct_group that takes a sales amount value x and returns an integer from 1 to 4 representing the quartile that x belongs to.

df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(pct_group) applies the pct_group function to each value in the amount column of df_sales_amount, and assigns the resulting quartile number to a new column called pct_group.

df_sales_amount.head(10) displays the first 10 rows of the resulting dataframe, which now has three columns: customer_id, amount, and pct_group.
 
 
Commentary :

This code is creating a new column 'pct_group' in a DataFrame 'df_temp' that contains the total sales amount of each customer and the percentile group they belong to based on their total sales amount.

The first line groups the 'df_receipt' DataFrame by customer_id and calculates the sum of sales amount for each customer, and then resets the index to make the customer_id a column again.

The next four lines calculate the 25th, 50th, 75th percentiles and the maximum value of the sales amount from the 'df_sales_amount' DataFrame, which is the same as the previously created 'df_temp' DataFrame.

The next line creates a new column 'quantile' in 'df_temp' by cutting the sales amount into four bins with the defined percentile values.

Finally, the last line creates a new column 'pct_group' in 'df_temp' that groups the data by 'quantile' and assigns a group number (1-4) to each customer based on which percentile group they belong to.

Note that the right endpoint of the last bin is defined as 'pct_max+0.1' to ensure that the maximum sales amount value is included in the last percentile group.
 
Commentary :

This code creates a new DataFrame df_customer_era containing the columns customer_id, birth_day, and era.

The customer_id and birth_day columns are taken from the original df_customer DataFrame.

The era column is computed from the age column in the df_customer DataFrame. For each value of age, the code computes the decade (in multiples of 10) by which the value should be rounded down. If the value of age is 60 or greater, then the era is set to 60.

The apply method applies a function (in this case, a lambda function) to each element in the age column, and the min and math.floor functions compute the decade. The resulting value is assigned to the era column in df_customer_era.

Finally, the head method is called to display the first 10 rows of the resulting DataFrame.
 
 
Commentary :

This code creates dummy variables for the categorical variable 'gender_cd' in the DataFrame 'df_customer'. The resulting DataFrame will have a column for each category in 'gender_cd' (presumably male and female), and each row will have a 1 in the appropriate column if the original customer record had that value for 'gender_cd', and 0s in all other columns.

The 'get_dummies' function from pandas is used for this purpose. It takes in a DataFrame and a list of columns to apply the dummy encoding to. In this case, it is called on the 'df_customer' DataFrame, and the 'gender_cd' column is specified in the 'columns' parameter.

The resulting DataFrame will have one column for each value of 'gender_cd' (presumably 0 for male and 1 for female, or vice versa depending on the encoding used in the original data), and each row will have a 1 in the column corresponding to the value of 'gender_cd' in that row, and 0s in all other columns. The first 10 rows of the resulting DataFrame are returned.
 
 
Commentary :

This code does the following:

Filters out the rows in df_receipt where customer_id starts with the letter "Z".

Groups the remaining rows by customer_id and calculates the sum of the amount column for each group using the agg method.

Resets the index of the resulting DataFrame to make customer_id a regular column.

Applies the preprocessing.scale function from scikit-learn to the amount column to calculate the z-score of each customer's total sales.

Adds a new column to the DataFrame called std_amount that contains the z-scores calculated in step 4.

The resulting DataFrame df_sales_amount contains one row for each customer, with columns for their customer_id, total sales amount, and std_amount, which is the z-score of their sales.
 
 
Commentary :

This code performs data preprocessing on a pandas DataFrame df_receipt. Here's what each line of the code does:

Filters out rows where the customer ID starts with "Z" using the query() method and regex engine python. The resulting DataFrame is assigned to df_sales_amount.

Groups the DataFrame df_sales_amount by customer ID using the groupby() method and calculates the sum of the amount column for each group using the agg() method. The resulting DataFrame is assigned to df_sales_amount.

Resets the index of df_sales_amount to turn the customer_id column into a regular column using the reset_index() method.

Applies the minmax_scale() function from the preprocessing module of scikit-learn to the amount column of df_sales_amount to scale the values between 0 and 1. The resulting scaled values are stored in a new column called scale_amount.

Therefore, this code computes the total amount of sales made by each customer, filters out rows for certain customers (whose IDs start with "Z"), and then scales the sales amount for each customer between 0 and 1 using min-max scaling.
 

 

Commentary :

This code performs feature scaling on the sales amount data for each customer in a retail store. Here's what each line does:

df_sales_amount: Creates a new DataFrame that includes the customer_id and total amount spent by each customer, obtained by grouping the df_receipt DataFrame by customer_id and summing the amount for each group.

scaler: Initializes a MinMaxScaler object from the preprocessing module of scikit-learn, which will scale the data in the range [0,1].

scaler.fit(): Fits the scaler object to the amount column of df_sales_amount, computing the minimum and maximum values of the data.

df_sales_amount['scale_amount']: Creates a new column in the df_sales_amount DataFrame to store the scaled amount data.

scaler.transform(): Scales the amount data for each customer using the minimum and maximum values learned by the scaler object.

df_sales_amount.head(10): Displays the first 10 rows of the updated df_sales_amount DataFrame, including the scaled amount data in the new scale_amount column.

Overall, this code performs feature scaling to bring the values of the amount data for each customer into the same range, making them comparable and more useful for analysis.
 
Data Science 100 Knocks (Structured Data Processing) - Python
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Comment