参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_Python.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Getting Started

First, run the following cells
Import the necessary libraries and read data from the database (PostgreSQL)
The following cell imports libraries that are expected to be used, such as pandas.
If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
The process can be divided into multiple times.
Names, addresses, etc. are dummy data and are not real.

In [ ]:

import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler


if 'PG_PORT' in os.environ:
    pgconfig = {
        'host': 'db',
        'port': os.environ['PG_PORT'],
        'database': os.environ['PG_DATABASE'],
        'user': os.environ['PG_USER'],
        'password': os.environ['PG_PASSWORD'],
    }


    # Connector for pd.read_sql
    conn = psycopg2.connect(**pgconfig)

    df_customer = pd.read_sql(sql='select * from customer', con=conn)
    df_category = pd.read_sql(sql='select * from category', con=conn)
    df_product = pd.read_sql(sql='select * from product', con=conn)
    df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
    df_store = pd.read_sql(sql='select * from store', con=conn)
    df_geocode = pd.read_sql(sql='select * from geocode', con=conn)

else:
    if not os.path.exists('data/'):
        !git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
        os.chdir('100knocks-preprocess/docker/work/')

    dtype = {
        'customer_id': str,
        'gender_cd': str,
        'postal_cd': str,
        'application_store_cd': str,
        'status_cd': str,
        'category_major_cd': str,
        'category_medium_cd': str,
        'category_small_cd': str,
        'product_cd': str,
        'store_cd': str,
        'prefecture_cd': str,
        'tel_no': str,
        'postal_cd': str,
        'street': str
    }

    df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
    df_category = pd.read_csv("data/category.csv", dtype=dtype)
    df_product = pd.read_csv("data/product.csv", dtype=dtype)
    df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
    df_store = pd.read_csv("data/store.csv", dtype=dtype)
    df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)

fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.

Exercise

P-061: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to ordinary logarithm (bottom 10) and display 10 items with customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python'). \
                                   groupby('customer_id'). \
                                   agg({'amount':'sum'}).reset_index()

df_sales_amount['log_amount'] = np.log10(df_sales_amount['amount'] + 0.5)

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	log_amount
0	CS001113000004	1298	3.113442
1	CS001114000005	626	2.796921
2	CS001115000010	3044	3.483516
3	CS001205000004	1988	3.298526
4	CS001205000006	3337	3.523421
5	CS001211000025	456	2.659441
6	CS001212000027	448	2.651762
7	CS001212000031	296	2.472025
8	CS001212000046	228	2.358886
9	CS001212000070	456	2.659441

Commentary :

The code performs the following actions:

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
It selects all rows from a DataFrame called df_receipt where the customer_id column does not start with the letter "Z". The resulting DataFrame is saved as df_sales_amount.

df_sales_amount = df_sales_amount.groupby('customer_id').agg({'amount':'sum'}).reset_index()
It groups the df_sales_amount DataFrame by customer_id and calculates the sum of the amount column for each group using the agg method. The resulting DataFrame is saved as df_sales_amount.

df_sales_amount['log_amount'] = np.log10(df_sales_amount['amount'] + 0.5)
It calculates the logarithm (base 10) of the amount column in the df_sales_amount DataFrame, adding 0.5 to the amount values first to avoid taking the logarithm of 0 or negative values. The resulting logarithmic values are saved in a new column called log_amount.

df_sales_amount.head(10)
It prints the first 10 rows of the resulting df_sales_amount DataFrame using the head method.

In summary, the code selects rows from a DataFrame where the customer ID does not start with "Z", groups those rows by customer ID, calculates the sum of sales amounts for each customer, takes the logarithm (base 10) of the sales amounts, and prints the first 10 rows of the resulting DataFrame.

P-062: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to the natural logarithm (bottom e) and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python'). \
                                   groupby('customer_id'). \
                                   agg({'amount':'sum'}).reset_index()

df_sales_amount['log_amount'] = np.log(df_sales_amount['amount'] + 0.5)

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	log_amount
0	CS001113000004	1298	7.168965
1	CS001114000005	626	6.440149
2	CS001115000010	3044	8.021092
3	CS001205000004	1988	7.595136
4	CS001205000006	3337	8.112977
5	CS001211000025	456	6.123589
6	CS001212000027	448	6.105909
7	CS001212000031	296	5.692047
8	CS001212000046	228	5.431536
9	CS001212000070	456	6.123589

Commentary :

This code performs some data processing on a pandas DataFrame called df_receipt.

First, it filters out any rows where the customer_id column starts with the letter "Z". This is done using the query() method with a condition that uses the str.startswith() method. The engine='python' argument is provided to use the Python parser instead of the default pandas parser for string operations.

The resulting DataFrame is then grouped by the customer_id column using the groupby() method.

The agg() method is used to apply a function to the amount column for each group. In this case, the sum() function is used to calculate the total sales amount for each customer.

The resulting DataFrame is then reset to have a new index using reset_index().

Finally, a new column called log_amount is created by taking the natural logarithm of the amount column plus 0.5 (to avoid taking the logarithm of zero or negative values).

The first 10 rows of the resulting DataFrame are displayed using the head() method.

P-063: Calculate the profit amount of each product from the unit price(unit price) and cost of the product(unit_cost) data (df_product) and display 10 results.

In [ ]:

df_tmp = df_product.copy()

df_tmp['unit_profit'] = df_tmp['unit_price'] - df_tmp['unit_cost']

df_tmp.head(10)

Out[ ]:

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	unit_profit
0	P040101001	04	0401	040101	198.0	149.0	49.0
1	P040101002	04	0401	040101	218.0	164.0	54.0
2	P040101003	04	0401	040101	230.0	173.0	57.0
3	P040101004	04	0401	040101	248.0	186.0	62.0
4	P040101005	04	0401	040101	268.0	201.0	67.0
5	P040101006	04	0401	040101	298.0	224.0	74.0
6	P040101007	04	0401	040101	338.0	254.0	84.0
7	P040101008	04	0401	040101	420.0	315.0	105.0
8	P040101009	04	0401	040101	498.0	374.0	124.0
9	P040101010	04	0401	040101	580.0	435.0	145.0

Commentary :

The code performs the following actions:

df_tmp = df_product.copy()
It creates a copy of a DataFrame called df_product and assigns it to a new variable called df_tmp. The copy is made to avoid modifying the original DataFrame.

df_tmp['unit_profit'] = df_tmp['unit_price'] - df_tmp['unit_cost']
It calculates the profit per unit for each product by subtracting the unit cost from the unit price, and saves the result in a new column called unit_profit. The resulting DataFrame is saved as df_tmp.

df_tmp.head(10)
It prints the first 10 rows of the df_tmp DataFrame using the head method.

In summary, the code creates a copy of a DataFrame, calculates the profit per unit for each product by subtracting the unit cost from the unit price, adds a new column with the profit per unit to the copied DataFrame, and prints the first 10 rows of the resulting DataFrame.

P-064: Calculate the overall average profit margin for each product from the unit price (unit_price) and cost (unit_cost) of the product data (df_product). Note, however, that unit_price and cost are missing.

In [ ]:

df_tmp = df_product.copy()

df_tmp['unit_profit_rate'] = \
            (df_tmp['unit_price'] - df_tmp['unit_cost']) / df_tmp['unit_price']

df_tmp['unit_profit_rate'].mean(skipna=True)

Out[ ]:

0.24911389885177

Commentary :

The code performs the following actions:

df_tmp = df_product.copy()
It creates a copy of a DataFrame called df_product and assigns it to a new variable called df_tmp. The copy is made to avoid modifying the original DataFrame.

df_tmp['unit_profit_rate'] = (df_tmp['unit_price'] - df_tmp['unit_cost']) / df_tmp['unit_price']
It calculates the profit rate per unit for each product by dividing the profit (unit price minus unit cost) by the unit price, and saves the result in a new column called unit_profit_rate. The resulting DataFrame is saved as df_tmp.

df_tmp['unit_profit_rate'].mean(skipna=True)
It calculates the mean of the unit_profit_rate column in the df_tmp DataFrame using the mean method, skipping missing values (NaN) if any, and prints the result

In summary, the code creates a copy of a DataFrame, calculates the profit rate per unit for each product by dividing the profit (unit price minus unit cost) by the unit price, adds a new column with the profit rate per unit to the copied DataFrame, calculates the mean of the profit rate column, and prints the resulting mean value.

P-065: For each product in the product data (df_product), find the new unit price that gives a profit margin of 30%. However, round down to the nearest yen. Then display 10 results and confirm that the profit margin is approximately around 30%. Note that there is a deficit in unit_price（unit_price）and unit_cost（unit_cost）.

In [ ]:

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()

df_tmp['new_price'] = np.floor(df_tmp['unit_cost'] / 0.7)

df_tmp['new_profit_rate'] = \
            (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']

df_tmp.head(10)

Out[ ]:

	product_cd	unit_price	unit_cost	new_price	new_profit_rate
0	P040101001	198.0	149.0	212.0	0.297170
1	P040101002	218.0	164.0	234.0	0.299145
2	P040101003	230.0	173.0	247.0	0.299595
3	P040101004	248.0	186.0	265.0	0.298113
4	P040101005	268.0	201.0	287.0	0.299652
5	P040101006	298.0	224.0	320.0	0.300000
6	P040101007	338.0	254.0	362.0	0.298343
7	P040101008	420.0	315.0	450.0	0.300000
8	P040101009	498.0	374.0	534.0	0.299625
9	P040101010	580.0	435.0	621.0	0.299517

Commentary :

The code performs the following actions:

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
It creates a new DataFrame called df_tmp that includes only three columns of interest: product_cd, unit_price, and unit_cost, and makes a copy of these columns from the original df_product DataFrame.

df_tmp['new_price'] = np.floor(df_tmp['unit_cost'] / 0.7)
It calculates a new price for each product by dividing the unit cost by 0.7 (a markup factor of 1.43), rounding down to the nearest integer using the floor method, and saves the result in a new column called new_price. The resulting DataFrame is saved as df_tmp.

df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
It calculates the profit rate per unit for each product using the new price calculated in the previous step, and saves the result in a new column called new_profit_rate. The resulting DataFrame is saved as df_tmp.

df_tmp.head(10)
It prints the first 10 rows of the df_tmp DataFrame using the head method.

In summary, the code creates a new DataFrame with three columns of interest, calculates a new price for each product based on a markup factor of 1.43, calculates the profit rate per unit for each product using the new price, adds two new columns to the copied DataFrame, and prints the first 10 rows of the resulting DataFrame.

P-066: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round to the nearest yen (rounding or rounding to even numbers is fine). Then display 10 results and check that the profit margin is around 30%. Note, however, that there are deficiencies in unit_price（unit_price） and unit_cost（unit_cost）.

In [ ]:

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()

df_tmp['new_price'] = np.round(df_tmp['unit_cost'] / 0.7)

df_tmp['new_profit_rate'] = \
            (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']

df_tmp.head(10)

Out[ ]:

	product_cd	unit_price	unit_cost	new_price	new_profit_rate
0	P040101001	198.0	149.0	213.0	0.300469
1	P040101002	218.0	164.0	234.0	0.299145
2	P040101003	230.0	173.0	247.0	0.299595
3	P040101004	248.0	186.0	266.0	0.300752
4	P040101005	268.0	201.0	287.0	0.299652
5	P040101006	298.0	224.0	320.0	0.300000
6	P040101007	338.0	254.0	363.0	0.300275
7	P040101008	420.0	315.0	450.0	0.300000
8	P040101009	498.0	374.0	534.0	0.299625
9	P040101010	580.0	435.0	621.0	0.299517

Commentary :

The code performs the following actions:

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
It creates a new DataFrame called df_tmp that includes only three columns of interest: product_cd, unit_price, and unit_cost, and makes a copy of these columns from the original df_product DataFrame.

df_tmp['new_price'] = np.round(df_tmp['unit_cost'] / 0.7)
It calculates a new price for each product by dividing the unit cost by 0.7 (a markup factor of 1.43), rounding to the nearest integer using the round method, and saves the result in a new column called new_price. The resulting DataFrame is saved as df_tmp.

df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
It calculates the profit rate per unit for each product using the new price calculated in the previous step, and saves the result in a new column called new_profit_rate. The resulting DataFrame is saved as df_tmp.

df_tmp.head(10)
It prints the first 10 rows of the df_tmp DataFrame using the head method.

In summary, the code creates a new DataFrame with three columns of interest, calculates a new price for each product based on a markup factor of 1.43 and rounding to the nearest integer, calculates the profit rate per unit for each product using the new price, adds two new columns to the copied DataFrame, and prints the first 10 rows of the resulting DataFrame.

P-067: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round up to the nearest yen. Then display 10 results and check that the profit margin is around 30%. Note, however, that there is a deficit in the unit price (unit_price) and cost (unit_cost).

In [ ]:

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()

df_tmp['new_price'] = np.ceil(df_tmp['unit_cost'] / 0.7)

df_tmp['new_profit_rate'] = \
            (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']

df_tmp.head(10)

Out[ ]:

	product_cd	unit_price	unit_cost	new_price	new_profit_rate
0	P040101001	198.0	149.0	213.0	0.300469
1	P040101002	218.0	164.0	235.0	0.302128
2	P040101003	230.0	173.0	248.0	0.302419
3	P040101004	248.0	186.0	266.0	0.300752
4	P040101005	268.0	201.0	288.0	0.302083
5	P040101006	298.0	224.0	320.0	0.300000
6	P040101007	338.0	254.0	363.0	0.300275
7	P040101008	420.0	315.0	451.0	0.301552
8	P040101009	498.0	374.0	535.0	0.300935
9	P040101010	580.0	435.0	622.0	0.300643

Commentary :

This code manipulats a pandas DataFrame object named df_product.

Here's a breakdown of the code line-by-line:

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy(): This line creates a new DataFrame object named df_tmp that is a subset of df_product and includes only the columns "product_cd", "unit_price", and "unit_cost". The .copy() method is used to create a new copy of the DataFrame instead of referencing the original df_product.

df_tmp['new_price'] = np.ceil(df_tmp['unit_cost'] / 0.7): This line calculates a new column called "new_price" in df_tmp, which is the result of dividing the "unit_cost" column by 0.7 and then taking the ceiling of the result using numpy's ceil() function. This calculats a new price based on a profit margin of 30%.

df_tmp['new_profit_rate'] = \ (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']: This line calculates a new column called "new_profit_rate" in df_tmp, which is the result of subtracting the "unit_cost" column from the "new_price" column and then dividing by the "new_price" column. This calculats the profit rate for the new price.

df_tmp.head(10): This line prints out the first 10 rows of the df_tmp DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.

P-068: For each product in the product data (df_product), find the amount including tax at a consumption tax rate of 10%, rounding down fractions of a yen and displaying 10 results. Note that the unit price (unit_price) is missing.

In [ ]:

df_tmp = df_tmp = df_product[['product_cd', 'unit_price']].copy()

df_tmp['tax_price'] = np.floor(df_tmp['unit_price'] * 1.1)

df_tmp.head(10)

Out[ ]:

	product_cd	unit_price	tax_price
0	P040101001	198.0	217.0
1	P040101002	218.0	239.0
2	P040101003	230.0	253.0
3	P040101004	248.0	272.0
4	P040101005	268.0	294.0
5	P040101006	298.0	327.0
6	P040101007	338.0	371.0
7	P040101008	420.0	462.0
8	P040101009	498.0	547.0
9	P040101010	580.0	638.0

Commentary :

This code also manipulats a pandas DataFrame object named df_product.

Here's a breakdown of the code line-by-line:

df_tmp = df_tmp = df_product[['product_cd', 'unit_price']].copy(): This line creates a new DataFrame object named df_tmp that is a subset of df_product and includes only the columns "product_cd" and "unit_price". The .copy() method is used to create a new copy of the DataFrame instead of referencing the original df_product.

df_tmp['tax_price'] = np.floor(df_tmp['unit_price'] * 1.1): This line calculates a new column called "tax_price" in df_tmp, which is the result of multiplying the "unit_price" column by 1.1 and then taking the floor of the result using numpy's floor() function. This calculates a new price with a 10% tax added.

df_tmp.head(10): This line prints out the first 10 rows of the df_tmp DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.

P-069: Combine receipt details data (df_receipt) and product data (df_product), calculate the total sales value of all products for each customer and the total sales value of products whose category major classification code (category_major_cd) is “07” (bottled canned food), and find the ratio between the two. Only customers with a sales record for category major category code “07” (bottled canned food) should be selected, and 10 results should be displayed.

In [ ]:

# Code example 1
df_tmp_1 = df_receipt.groupby('customer_id').agg({'amount':'sum'}). \
                reset_index().rename(columns={'amount':'sum_all'})

df_tmp_2 = pd.merge(df_receipt, df_product.query('category_major_cd == "07"'), 
                how='inner', on='product_cd').groupby('customer_id').\
                agg({'amount':'sum'}).reset_index().\
                rename(columns={'amount':'sum_07'})

df_tmp_3 = pd.merge(df_tmp_1, df_tmp_2, how='inner', on='customer_id')

df_tmp_3['sales_rate'] = df_tmp_3['sum_07'] / df_tmp_3['sum_all']

df_tmp_3.head(10)

Out[ ]:

	customer_id	sum_all	sum_07	sales_rate
0	CS001113000004	1298	1298	1.000000
1	CS001114000005	626	486	0.776358
2	CS001115000010	3044	2694	0.885020
3	CS001205000004	1988	346	0.174044
4	CS001205000006	3337	2004	0.600539
5	CS001212000027	448	200	0.446429
6	CS001212000031	296	296	1.000000
7	CS001212000046	228	108	0.473684
8	CS001212000070	456	308	0.675439
9	CS001213000018	243	145	0.596708

Commentary :

This code also manipulats pandas DataFrame objects, df_receipt and df_product.

Here's a breakdown of the code line-by-line:

df_tmp_1 = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index().rename(columns={'amount':'sum_all'}): This line groups the df_receipt DataFrame by "customer_id", calculates the sum of the "amount" column for each group, and then resets the index of the resulting DataFrame. The resulting DataFrame is then renamed to "sum_all" using the .rename() method.

df_tmp_2 = pd.merge(df_receipt, df_product.query('category_major_cd == "07"'), how='inner', on='product_cd').groupby('customer_id').agg({'amount':'sum'}).reset_index().rename(columns={'amount':'sum_07'}): This line first filters df_product to include only rows where the "category_major_cd" column equals "07". It then merges this filtered DataFrame with df_receipt on the "product_cd" column using an inner join. The resulting DataFrame is then grouped by "customer_id", the sum of the "amount" column is calculated for each group, and the index is reset. The resulting DataFrame is then renamed to "sum_07" using the .rename() method.

df_tmp_3 = pd.merge(df_tmp_1, df_tmp_2, how='inner', on='customer_id'): This line merges df_tmp_1 and df_tmp_2 on the "customer_id" column using an inner join. This creates a new DataFrame that contains the "sum_all" and "sum_07" columns for each customer.

df_tmp_3['sales_rate'] = df_tmp_3['sum_07'] / df_tmp_3['sum_all']: This line calculates a new column in df_tmp_3 called "sales_rate", which is the result of dividing the "sum_07" column by the "sum_all" column. This calculates the percentage of a customer's total purchases that are from the "07" category.

df_tmp_3.head(10): This line prints out the first 10 rows of the df_tmp_3 DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.

In [ ]:

# Code example 2 (for reference, using unstack and horizontal sum)
df_temp = df_receipt.merge(df_product, how='left', on='product_cd'). \
        groupby(['customer_id', 'category_major_cd'])['amount'].sum().unstack()

df_temp = df_temp[df_temp['07'] > 0]

df_temp['sum_all'] = df_temp.sum(axis=1)

df_temp['sales_rate'] = df_temp['07'] / df_temp['sum_all']

# The rest of the process is for formatting and displaying the data frame
df_temp.columns.name = ''

df_temp = df_temp.reset_index()

df_temp.head(10)

Out[ ]:

	customer_id	04	05	06	07	08	09	sum_all	sales_rate
0	CS001113000004	NaN	NaN	NaN	1298.0	NaN	NaN	1298.0	1.000000
1	CS001114000005	NaN	40.0	NaN	486.0	100.0	NaN	626.0	0.776358
2	CS001115000010	NaN	NaN	NaN	2694.0	NaN	350.0	3044.0	0.885020
3	CS001205000004	100.0	128.0	286.0	346.0	368.0	760.0	1988.0	0.174044
4	CS001205000006	635.0	60.0	198.0	2004.0	80.0	360.0	3337.0	0.600539
5	CS001212000027	248.0	NaN	NaN	200.0	NaN	NaN	448.0	0.446429
6	CS001212000031	NaN	NaN	NaN	296.0	NaN	NaN	296.0	1.000000
7	CS001212000046	NaN	NaN	NaN	108.0	NaN	120.0	228.0	0.473684
8	CS001212000070	NaN	NaN	148.0	308.0	NaN	NaN	456.0	0.675439
9	CS001213000018	NaN	NaN	NaN	145.0	98.0	NaN	243.0	0.596708

Commentary :

This code also manipulates pandas DataFrame objects, df_receipt and df_product.

Here's a breakdown of the code line-by-line:

df_temp = df_receipt.merge(df_product, how='left', on='product_cd').groupby(['customer_id', 'category_major_cd'])['amount'].sum().unstack(): This line first merges df_receipt and df_product on the "product_cd" column using a left join. It then groups the resulting DataFrame by "customer_id" and "category_major_cd", calculates the sum of the "amount" column for each group, and unstacks the resulting Series to create a DataFrame with "customer_id" as the index, "category_major_cd" as the column names, and the sum of the "amount" column as the values.

df_temp = df_temp[df_temp['07'] > 0]: This line filters the df_temp DataFrame to only include rows where the value in the "07" column is greater than 0. This removes customers who have not purchased anything from the "07" category.

df_temp['sum_all'] = df_temp.sum(axis=1): This line calculates a new column in df_temp called "sum_all", which is the sum of all the columns in each row. This calculats the total amount spent by each customer across all categories.

df_temp['sales_rate'] = df_temp['07'] / df_temp['sum_all']: This line calculates a new column in df_temp called "sales_rate", which is the result of dividing the "07" column by the "sum_all" column. This calculates the percentage of a customer's total purchases that are from the "07" category.

df_temp.columns.name = '': This line removes the column name from the df_temp DataFrame.

df_temp = df_temp.reset_index(): This line resets the index of df_temp to create a new column called "index" with a range of integers starting from 0.

df_temp.head(10): This line prints out the first 10 rows of the df_temp DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.

P-070: Calculate the number of days elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt) and display 10 items with the customer ID (customer_id), sales date and membership application date. (Note that sales_ymd is numeric and application_date is a string).

In [ ]:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']],
                  how='inner', on='customer_id')

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])

df_tmp['elapsed_days'] = df_tmp['sales_ymd'] - df_tmp['application_date']

df_tmp['elapsed_days'] = df_tmp['elapsed_days'].dt.days

df_tmp.head(10)

Out[ ]:

	customer_id	sales_ymd	application_date	elapsed_days
0	CS006214000001	2018-11-03	1970-01-01 00:00:00.020150201	17837
1	CS006214000001	2017-05-09	1970-01-01 00:00:00.020150201	17294
2	CS006214000001	2017-06-08	1970-01-01 00:00:00.020150201	17324
3	CS006214000001	2018-10-28	1970-01-01 00:00:00.020150201	17831
4	CS006214000001	2019-09-08	1970-01-01 00:00:00.020150201	18146
5	CS006214000001	2018-01-31	1970-01-01 00:00:00.020150201	17561
6	CS006214000001	2017-07-05	1970-01-01 00:00:00.020150201	17351
7	CS006214000001	2018-11-10	1970-01-01 00:00:00.020150201	17844
8	CS006214000001	2019-04-10	1970-01-01 00:00:00.020150201	17995
9	CS006214000001	2019-06-01	1970-01-01 00:00:00.020150201	18047

Commentary :

This code is performing data manipulation on a DataFrame named df_receipt and another DataFrame named df_customer. It calculates the number of days between a customer's application date and their purchase date.

Here's a step-by-step breakdown of the code:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): creates a new DataFrame named df_tmp by selecting only the columns "customer_id" and "sales_ymd" from the df_receipt DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): performs an inner join on df_tmp and the df_customer DataFrame, selecting only the "customer_id" and "application_date" columns from the df_customer DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in both df_tmp and df_customer.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): converts the "sales_ymd" column of df_tmp to a datetime format using the pd.to_datetime() function. This function converts the column from a string format to a datetime format that can be used for calculations.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): converts the "application_date" column of df_tmp to a datetime format using the pd.to_datetime() function. This function converts the column from a string format to a datetime format that can be used for calculations.

df_tmp['elapsed_days'] = df_tmp['sales_ymd'] - df_tmp['application_date']: calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_days" in df_tmp. This calculation will result in a timedelta object, which represents the difference between two dates in days, hours, minutes, and seconds.

df_tmp['elapsed_days'] = df_tmp['elapsed_days'].dt.days: extracts only the days from the "elapsed_days" column and assigns the result back to the same column. This step is necessary because the "elapsed_days" column currently contains a timedelta object, but we only want to keep the number of days.

df_tmp.head(10): displays the first 10 rows of df_tmp to verify that the calculations were performed correctly.

Overall, this code is useful for analyzing customer behavior and determining how long it takes for customers to make a purchase after applying for a membership or account.

P-071: Calculate the number of months elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).

In [ ]:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']],
                  how='inner', on='customer_id')

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])

df_tmp['elapsed_months'] = df_tmp[['sales_ymd', 'application_date']]. \
    apply(lambda x: relativedelta(x[0], x[1]).years * 12 + \
                    relativedelta(x[0], x[1]).months, axis=1)

df_tmp.head(10)

Out[ ]:

	customer_id	sales_ymd	application_date	elapsed_months
0	CS006214000001	2018-11-03	1970-01-01 00:00:00.020150201	586
1	CS006214000001	2017-05-09	1970-01-01 00:00:00.020150201	568
2	CS006214000001	2017-06-08	1970-01-01 00:00:00.020150201	569
3	CS006214000001	2018-10-28	1970-01-01 00:00:00.020150201	585
4	CS006214000001	2019-09-08	1970-01-01 00:00:00.020150201	596
5	CS006214000001	2018-01-31	1970-01-01 00:00:00.020150201	576
6	CS006214000001	2017-07-05	1970-01-01 00:00:00.020150201	570
7	CS006214000001	2018-11-10	1970-01-01 00:00:00.020150201	586
8	CS006214000001	2019-04-10	1970-01-01 00:00:00.020150201	591
9	CS006214000001	2019-06-01	1970-01-01 00:00:00.020150201	592

Commentary :

This code is similar to the previous code we discussed, but instead of calculating the number of days between a customer's application date and their purchase date, it calculates the number of elapsed months.

Here's a step-by-step breakdown of the code:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): creates a new DataFrame named df_tmp by selecting only the columns "customer_id" and "sales_ymd" from the df_receipt DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): performs an inner join on df_tmp and the df_customer DataFrame, selecting only the "customer_id" and "application_date" columns from the df_customer DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in both df_tmp and df_customer.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): converts the "sales_ymd" column of df_tmp to a datetime format using the pd.to_datetime() function.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): converts the "application_date" column of df_tmp to a datetime format using the pd.to_datetime() function.

df_tmp['elapsed_months'] = df_tmp[['sales_ymd', 'application_date']]. \ apply(lambda x: relativedelta(x[0], x[1]).years * 12 + \ relativedelta(x[0], x[1]).months, axis=1): calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_months" in df_tmp. This calculation uses the relativedelta function from the dateutil library to compute the difference between two datetime values in terms of years and months. The apply() method applies this function to each row of the "sales_ymd" and "application_date" columns, and returns the total number of elapsed months.

df_tmp.head(10): displays the first 10 rows of df_tmp to verify that the calculations were performed correctly.

Overall, this code is useful for analyzing customer behavior and determining how long it takes for customers to make a purchase after applying for a membership or account, in terms of elapsed months.

P-072: Calculate the number of years elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (df_customer) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).

In [ ]:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']],
                  how='inner', on='customer_id')

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])

df_tmp['elapsed_years'] = df_tmp[['sales_ymd', 'application_date']]. \
                    apply(lambda x: relativedelta(x[0], x[1]).years, axis=1)

df_tmp.head(10)

Out[ ]:

	customer_id	sales_ymd	application_date	elapsed_years
0	CS006214000001	2018-11-03	1970-01-01 00:00:00.020150201	48
1	CS006214000001	2017-05-09	1970-01-01 00:00:00.020150201	47
2	CS006214000001	2017-06-08	1970-01-01 00:00:00.020150201	47
3	CS006214000001	2018-10-28	1970-01-01 00:00:00.020150201	48
4	CS006214000001	2019-09-08	1970-01-01 00:00:00.020150201	49
5	CS006214000001	2018-01-31	1970-01-01 00:00:00.020150201	48
6	CS006214000001	2017-07-05	1970-01-01 00:00:00.020150201	47
7	CS006214000001	2018-11-10	1970-01-01 00:00:00.020150201	48
8	CS006214000001	2019-04-10	1970-01-01 00:00:00.020150201	49
9	CS006214000001	2019-06-01	1970-01-01 00:00:00.020150201	49

Commentary :

This code is similar to the previous examples we discussed, but instead of calculating the number of elapsed days or months between a customer's application date and purchase date, it calculates the number of elapsed years.

Here's a step-by-step breakdown of the code:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): creates a new DataFrame named df_tmp by selecting only the columns "customer_id" and "sales_ymd" from the df_receipt DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): performs an inner join on df_tmp and the df_customer DataFrame, selecting only the "customer_id" and "application_date" columns from the df_customer DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in both df_tmp and df_customer.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): converts the "sales_ymd" column of df_tmp to a datetime format using the pd.to_datetime() function.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): converts the "application_date" column of df_tmp to a datetime format using the pd.to_datetime() function.

df_tmp['elapsed_years'] = df_tmp[['sales_ymd', 'application_date']]. \ apply(lambda x: relativedelta(x[0], x[1]).years, axis=1): calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_years" in df_tmp. This calculation uses the relativedelta function from the dateutil library to compute the difference between two datetime values in terms of years. The apply() method applies this function to each row of the "sales_ymd" and "application_date" columns, and returns the total number of elapsed years.

df_tmp.head(10): displays the first 10 rows of df_tmp to verify that the calculations were performed correctly.

Overall, this code is useful for analyzing customer behavior and determining how long it takes for customers to make a purchase after applying for a membership or account, in terms of elapsed years.

P-073: Calculate the elapsed time in epoch seconds from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and Note that sales_ymd is held as a number and application_date as a string. Note that no time information is held, so each date should represent 0:00:00:00.

In [ ]:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']],
                  how='inner', on='customer_id')

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])

df_tmp['elapsed_epoch'] = df_tmp['sales_ymd'].view(np.int64) - \
                                    df_tmp['application_date'].view(np.int64)

df_tmp['elapsed_epoch'] = df_tmp['elapsed_epoch'] / 10**9

df_tmp.head(10)

Out[ ]:

	customer_id	sales_ymd	application_date	elapsed_epoch
0	CS006214000001	2018-11-03	1970-01-01 00:00:00.020150201	1.541203e+09
1	CS006214000001	2017-05-09	1970-01-01 00:00:00.020150201	1.494288e+09
2	CS006214000001	2017-06-08	1970-01-01 00:00:00.020150201	1.496880e+09
3	CS006214000001	2018-10-28	1970-01-01 00:00:00.020150201	1.540685e+09
4	CS006214000001	2019-09-08	1970-01-01 00:00:00.020150201	1.567901e+09
5	CS006214000001	2018-01-31	1970-01-01 00:00:00.020150201	1.517357e+09
6	CS006214000001	2017-07-05	1970-01-01 00:00:00.020150201	1.499213e+09
7	CS006214000001	2018-11-10	1970-01-01 00:00:00.020150201	1.541808e+09
8	CS006214000001	2019-04-10	1970-01-01 00:00:00.020150201	1.554854e+09
9	CS006214000001	2019-06-01	1970-01-01 00:00:00.020150201	1.559347e+09

Commentary :

This code calculates the elapsed time in seconds between a customer's application date and purchase date by converting the datetime values to Unix timestamps.

Here's a step-by-step breakdown of the code:

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): creates a new DataFrame named df_tmp by selecting only the columns "customer_id" and "sales_ymd" from the df_receipt DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): performs an inner join on df_tmp and the df_customer DataFrame, selecting only the "customer_id" and "application_date" columns from the df_customer DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in both df_tmp and df_customer.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): converts the "sales_ymd" column of df_tmp to a datetime format using the pd.to_datetime() function.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): converts the "application_date" column of df_tmp to a datetime format using the pd.to_datetime() function.

df_tmp['elapsed_epoch'] = df_tmp['sales_ymd'].view(np.int64) - \ df_tmp['application_date'].view(np.int64): calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_epoch" in df_tmp. This calculation uses the view() method to convert the datetime values to an int64 format (Unix timestamp), and then subtracts the two values element-wise to obtain the elapsed time in nanoseconds.

df_tmp['elapsed_epoch'] = df_tmp['elapsed_epoch'] / 10**9: converts the elapsed time from nanoseconds to seconds by dividing the "elapsed_epoch" column by 10^9.

df_tmp.head(10): displays the first 10 rows of df_tmp to verify that the calculations were performed correctly.

Overall, this code is useful for analyzing customer behavior and determining the length of time it takes for customers to make a purchase after applying for a membership or account, in terms of elapsed seconds.

P-074: For the sales date (sales_ymd) of the receipt details data (df_receipt), calculate the number of days elapsed since Monday of the week in question and display 10 items together with the sales date and the previous Monday (note that sales_ymd holds data in numerical form).

In [ ]:

df_tmp = df_receipt[['sales_ymd']].copy()

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))

df_tmp['elapsed_days']  = df_tmp['sales_ymd'].apply(lambda x:x.weekday())

df_tmp['monday'] = \
    df_tmp['sales_ymd'].apply(lambda x: x - relativedelta(days=x.weekday()))

df_tmp.head(10)

Out[ ]:

	sales_ymd	elapsed_days	monday
0	2018-11-03	5	2018-10-29
1	2018-11-18	6	2018-11-12
2	2017-07-12	2	2017-07-10
3	2019-02-05	1	2019-02-04
4	2018-08-21	1	2018-08-20
5	2019-06-05	2	2019-06-03
6	2018-12-05	2	2018-12-03
7	2019-09-22	6	2019-09-16
8	2017-05-04	3	2017-05-01
9	2019-10-10	3	2019-10-07

Commentary :

The code is performing the following steps:

A new DataFrame called df_tmp is created by selecting a single column sales_ymd from an existing DataFrame df_receipt, and then making a copy of that column.

The sales_ymd column in df_tmp is then converted to a pandas datetime format using the pd.to_datetime function. This allows the code to work with dates in a more flexible way, such as being able to extract information like the day of the week.

A new column called elapsed_days is created in df_tmp using the apply method. This method applies a lambda function to each row of the sales_ymd column, which in this case returns the day of the week as an integer (with Monday being 0 and Sunday being 6).

Another new column called monday is created in df_tmp using the apply method with another lambda function. This function takes each row in the sales_ymd column, subtracts the number of days elapsed since the beginning of the week (which is obtained from x.weekday()), and returns the resulting date rounded down to the nearest Monday. This effectively groups all the dates in df_tmp into weekly intervals starting on a Monday.

Finally, the head method is called on df_tmp to display the first 10 rows of the resulting DataFrame.

Overall, this code processes a column of sales dates to extract useful information for analysis or aggregation purposes. Specifically, it creates new columns in the DataFrame that represent the day of the week and the start date of the corresponding week, which could be used for grouping and aggregating sales data by week or day of the week.

P-075: Randomly extract 1% of the data from the customer data (df_customer) and display the first 10 items.

In [ ]:

df_customer.sample(frac=0.01).head(10)

Out[ ]:

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
9589	CS014703000006	長谷部扶樹	0	男性	1944-05-15	74	264-0037	千葉県千葉市若葉区源町**********	S12014	20150707	0-00000000-0
13962	CS024612000073	柴田奈月	1	女性	1949-06-11	69	216-0015	神奈川県川崎市宮前区菅生**********	S14024	20150927	0-00000000-0
14351	CS006504000007	小島貴嶺	0	男性	1964-06-30	54	224-0062	神奈川県横浜市都筑区葛が谷**********	S14006	20150821	0-00000000-0
6092	CS003513000071	菊地瞬	1	女性	1962-12-27	56	182-0022	東京都調布市国領町**********	S13003	20150616	0-00000000-0
18415	CS037415000043	市川路子	1	女性	1976-03-17	43	136-0072	東京都江東区大島**********	S13037	20151130	D-20101020-D
6023	CS019603000011	秋本隼士	0	男性	1952-02-24	67	171-0051	東京都豊島区長崎**********	S13019	20150413	0-00000000-0
17698	CS015515000283	上杉淳子	1	女性	1963-02-20	56	136-0076	東京都江東区南砂**********	S13015	20180918	0-00000000-0
9157	CS017415000099	小柳光	1	女性	1976-03-20	43	166-0002	東京都杉並区高円寺北**********	S13017	20150724	E-20100528-D
18939	CS003613000341	益田京子	1	女性	1955-02-11	64	201-0013	東京都狛江市元和泉**********	S13003	20171205	0-00000000-0
3321	CS003401000072	伊集院三省	0	男性	1971-09-25	47	206-0812	東京都稲城市矢野口**********	S13003	20180422	0-00000000-0

Commentary :

The code is sampling a subset of rows from a pandas DataFrame called df_customer, and then displaying the first 10 rows of that subset.

Here's what each part of the code does:

df_customer is a pandas DataFrame that contains customer data, presumably with many rows and columns.

The sample method is called on df_customer with the argument frac=0.01. This method randomly selects a fraction of the rows in the DataFrame, where frac represents the fraction of rows to sample. In this case, frac=0.01 means that 1% of the rows in df_customer will be selected at random.

The head method is called on the resulting subset of df_customer, with the argument 10. This method returns the first 10 rows of the subset, which are then displayed.

So overall, this code is selecting a random sample of 1% of the rows from df_customer, and then displaying the first 10 rows of that sample. This could be useful for exploring the data or performing quick analyses on a smaller subset of the data without having to work with the entire dataset.

P-076: Extract 10% of the data stratified randomly from the customer data (df_customer) based on the percentage of gender code (gender_cd) and count the number of cases by gender code.

In [ ]:

# Example using sklearn.model_selection.train_test_split
_, df_tmp = train_test_split(df_customer, test_size=0.1, 
                             stratify=df_customer['gender_cd'])

df_tmp.groupby('gender_cd').agg({'customer_id' : 'count'})

Out[ ]:

	customer_id
gender_cd
0	298
1	1793
9	107

Commentary :

The code performs the following steps:

train_test_split is a function from the scikit-learn library that randomly splits a dataset into two subsets: a "training" subset and a "testing" subset. In this code, it is applied to a pandas DataFrame called df_customer, with the argument test_size=0.1 specifying that 10% of the data should be reserved for testing, and the argument stratify=df_customer['gender_cd'] specifying that the split should be stratified by the values in the gender_cd column. The function returns two DataFrames, but the first one is ignored (assigned to _, which is a convention in Python for a variable that is not going to be used).

The second DataFrame returned by train_test_split is assigned to a new variable called df_tmp.

The groupby method is called on df_tmp with the argument gender_cd. This method groups the rows of df_tmp by the values in the gender_cd column.

The agg method is called on the result of the groupby operation, with the argument {'customer_id' : 'count'}. This method applies a function (in this case, count) to the customer_id column of each group, and returns a DataFrame with the resulting counts for each group.

The resulting DataFrame is displayed.

So overall, this code randomly splits a pandas DataFrame into training and testing subsets, stratifying the split by the values in a specific column (gender_cd). It then creates a new DataFrame from the testing subset and groups it by the gender_cd column, and calculates the count of unique values of the customer_id column for each group. This can be useful for understanding the distribution of customers by gender in the testing subset of the data.

P-077: Sum the sales amounts in the receipt details data (df_receipt) by customer unit and extract outliers of the summed sales amounts. The outliers should be calculated by logarithmising the total sales amount, calculating the mean and standard deviation, and then deviating from the mean by more than 3σ (either the natural logarithm or the common logarithm is acceptable). Display 10 results.

In [ ]:

df_sales_amount = df_receipt.groupby('customer_id'). \
                                agg({'amount':'sum'}).reset_index()

df_sales_amount['log_sum_amount'] = np.log(df_sales_amount['amount'] + 0.5)

df_sales_amount['log_sum_amount_ss'] = preprocessing.scale(df_sales_amount['log_sum_amount'])

df_sales_amount.query('abs(log_sum_amount_ss) > 3').head(10)

Out[ ]:

	customer_id	amount	log_sum_amount	log_sum_amount_ss
8306	ZZ000000000000	12395003	16.332804	7.967822

Commentary :

This code is performing data preprocessing and filtering on a pandas DataFrame object named df_receipt.

Here is a step-by-step explanation of the code:

df_sales_amount = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index(): This line groups the rows of df_receipt by the values in the customer_id column and calculates the sum of the amount column for each group. The resulting DataFrame df_sales_amount has one row per unique customer_id value and two columns: customer_id and amount. The reset_index() method is used to reset the index to integer values after the grouping.

df_sales_amount['log_sum_amount'] = np.log(df_sales_amount['amount'] + 0.5): This line creates a new column in df_sales_amount named log_sum_amount that contains the natural logarithm of the sum of amount for each customer, plus 0.5 to avoid taking the log of 0.

df_sales_amount['log_sum_amount_ss'] = preprocessing.scale(df_sales_amount['log_sum_amount']): This line creates another new column in df_sales_amount named log_sum_amount_ss that contains the scaled values of log_sum_amount using the preprocessing.scale function from the scikit-learn library. This function subtracts the mean and divides by the standard deviation of the input array, resulting in a new array with zero mean and unit variance.

df_sales_amount.query('abs(log_sum_amount_ss) > 3').head(10): This line filters df_sales_amount to only keep the rows where the absolute value of log_sum_amount_ss is greater than 3, and then selects the first 10 rows of the resulting DataFrame. This filter selects customers whose total sales amount is more than 3 standard deviations away from the mean, indicating that they may be outliers. The query method is used to filter the rows based on a string expression, which in this case is 'abs(log_sum_amount_ss) > 3'.

P-078: Sum the sales amount (amount) of the receipt details data (df_receipt) in customer units and extract outliers of the summed sales amount. However, exclude those whose customer ID starts with “Z”, as they represent non-members. Outliers are defined as the difference between the first and third quartiles using IQR, and are defined as being below “first quartile – 1.5 x IQR” or above “third quartile + 1.5 x IQR”. Show 10 results.

In [ ]:

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python'). \
                                   groupby('customer_id'). \
                                   agg({'amount':'sum'}).reset_index()

pct25 = np.percentile(df_sales_amount['amount'], q=25)
pct75 = np.percentile(df_sales_amount['amount'], q=75)

iqr = pct75 - pct25
amount_low = pct25 - (iqr * 1.5)
amount_hight = pct75 + (iqr * 1.5)

df_sales_amount.query('amount < @amount_low or @amount_hight < amount').head(10)

Out[ ]:

	customer_id	amount
98	CS001414000048	8584
332	CS001605000009	18925
549	CS002415000594	9568
1180	CS004414000181	9584
1558	CS005415000137	8734
1733	CS006414000001	9156
1736	CS006414000029	9179
1752	CS006415000105	10042
1755	CS006415000147	12723
1757	CS006415000157	10648

Commentary :

This code is performing data preprocessing and filtering on a pandas DataFrame object named df_receipt.

Here is a step-by-step explanation of the code:

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').agg({'amount':'sum'}).reset_index(): This line first filters df_receipt to exclude rows where the customer_id column starts with the letter "Z". The query method is used to filter the rows based on a string expression. The engine='python' argument is used to specify the use of Python engine for the query, which allows for the use of the str.startswith method. The resulting DataFrame is then grouped by customer_id and the sum of the amount column is calculated for each group. The resulting DataFrame df_sales_amount has one row per unique customer_id value and two columns: customer_id and amount. The reset_index() method is used to reset the index to integer values after the grouping.

pct25 = np.percentile(df_sales_amount['amount'], q=25): This line calculates the 25th percentile of the amount column in df_sales_amount using the numpy.percentile function and assigns the result to the variable pct25.

pct75 = np.percentile(df_sales_amount['amount'], q=75): This line calculates the 75th percentile of the amount column in df_sales_amount using the numpy.percentile function and assigns the result to the variable pct75.

iqr = pct75 - pct25: This line calculates the interquartile range (IQR) of the amount column in df_sales_amount by subtracting the 25th percentile (pct25) from the 75th percentile (pct75) and assigns the result to the variable iqr.

amount_low = pct25 - (iqr * 1.5): This line calculates the lower bound of the "normal" range of sales amounts using the formula amount_low = pct25 - (iqr * 1.5). This is commonly known as the lower "whisker" of the box plot.

amount_high = pct75 + (iqr * 1.5): This line calculates the upper bound of the "normal" range of sales amounts using the formula amount_high = pct75 + (iqr * 1.5). This is commonly known as the upper "whisker" of the box plot.

df_sales_amount.query('amount < @amount_low or @amount_high < amount').head(10): This line filters df_sales_amount to only keep the rows where the amount column is less than amount_low or greater than amount_high, and then selects the first 10 rows of the resulting DataFrame. This filter selects customers whose total sales amount falls outside the "normal" range of sales amounts, which is defined as the range between the lower and upper whiskers of the box plot. The query method is used to filter the rows based on a string expression that uses the @ symbol to refer to the Python variable amount_low and amount_high.

P-079: For each item of product data (df_product), check the number of missing items.

In [ ]:

df_product.isnull().sum()

Out[ ]:

product_cd            0
category_major_cd     0
category_medium_cd    0
category_small_cd     0
unit_price            7
unit_cost             7
dtype: int64

Commentary :

This code is used to calculate the number of missing (null) values in each column of a pandas DataFrame df_product. Here is a step-by-step explanation:

df_product: This is the name of the DataFrame being used.

isnull(): This method returns a DataFrame of the same shape as df_product where each element is a boolean value indicating whether the corresponding element in df_product is null or not.

sum(): This method is applied to the boolean DataFrame returned by isnull() and returns the sum of the boolean values for each column. Since boolean values are treated as integers, this effectively counts the number of null values in each column of df_product.

Therefore, df_product.isnull().sum() returns a Series object where the index is the column names of df_product and the values are the number of null values in each column. This can be used to identify which columns have missing values and how many missing values there are in each column.

P-080: Create a new product data, deleting all records where any item in the product data (df_product) is missing. Note that the number of items before and after the deletion should be displayed, and also confirm that the number of items has decreased by the number of items checked in 079.

In [ ]:

df_product_1 = df_product.copy()

df_product_1.dropna(inplace=True)

print('Before deletion:', len(df_product))
print('After deletion:', len(df_product_1))

Before deletion: 10030
After deletion: 10023

Commentary :

This code is used to remove all rows containing any null (missing) values from a pandas DataFrame df_product and create a new DataFrame df_product_1 that contains only the non-null rows. Here is a step-by-step explanation:

df_product: This is the name of the original DataFrame that is being used.

df_product.copy(): This method creates a copy of the original DataFrame df_product so that the original DataFrame is not modified.

df_product_1: This is the name of the new DataFrame that is being created.

dropna(): This method is used to drop all rows that contain any null values from df_product_1. The inplace=True argument is used to modify df_product_1 in place rather than returning a new DataFrame.

print('Before deletion:', len(df_product)): This line prints the number of rows in the original DataFrame df_product before the null values are dropped.

print('After deletion:', len(df_product_1)): This line prints the number of rows in the new DataFrame df_product_1 after the null values are dropped.

Therefore, this code first creates a copy of the original DataFrame df_product, drops all rows that contain any null values from the copied DataFrame to create a new DataFrame df_product_1, and finally prints the number of rows in the original and new DataFrames before and after the deletion of null values. This code can be used to identify the number of rows that are removed due to null values and to create a clean DataFrame that does not contain any null values.

Data Science 100 Knocks (Structured Data Processing) - Python

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.