参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_Python.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Getting Started

First, run the following cells
Import the necessary libraries and read data from the database (PostgreSQL)
The following cell imports libraries that are expected to be used, such as pandas.
If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
The process can be divided into multiple times.
Names, addresses, etc. are dummy data and are not real.

In [ ]:

import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler


if 'PG_PORT' in os.environ:
    pgconfig = {
        'host': 'db',
        'port': os.environ['PG_PORT'],
        'database': os.environ['PG_DATABASE'],
        'user': os.environ['PG_USER'],
        'password': os.environ['PG_PASSWORD'],
    }


    # Connector for pd.read_sql
    conn = psycopg2.connect(**pgconfig)

    df_customer = pd.read_sql(sql='select * from customer', con=conn)
    df_category = pd.read_sql(sql='select * from category', con=conn)
    df_product = pd.read_sql(sql='select * from product', con=conn)
    df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
    df_store = pd.read_sql(sql='select * from store', con=conn)
    df_geocode = pd.read_sql(sql='select * from geocode', con=conn)

else:
    if not os.path.exists('data/'):
        !git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
        os.chdir('100knocks-preprocess/docker/work/')

    dtype = {
        'customer_id': str,
        'gender_cd': str,
        'postal_cd': str,
        'application_store_cd': str,
        'status_cd': str,
        'category_major_cd': str,
        'category_medium_cd': str,
        'category_small_cd': str,
        'product_cd': str,
        'store_cd': str,
        'prefecture_cd': str,
        'tel_no': str,
        'postal_cd': str,
        'street': str
    }

    df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
    df_category = pd.read_csv("data/category.csv", dtype=dtype)
    df_product = pd.read_csv("data/product.csv", dtype=dtype)
    df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
    df_store = pd.read_csv("data/store.csv", dtype=dtype)
    df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)

fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.

Exercise

P-041: Sum up the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd) and calculate the increase or decrease in sales amount from the date of the last sale. And display 10 results.

In [ ]:

df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].\
                                groupby('sales_ymd').sum().reset_index()

df_sales_amount_by_date = pd.concat([df_sales_amount_by_date, 
                                     df_sales_amount_by_date.shift()], axis=1)

df_sales_amount_by_date.columns = ['sales_ymd','amount','lag_ymd','lag_amount']

df_sales_amount_by_date['diff_amount'] = \
    df_sales_amount_by_date['amount'] - df_sales_amount_by_date['lag_amount']

df_sales_amount_by_date.head(10)

Out[ ]:

	sales_ymd	amount	lag_ymd	lag_amount	diff_amount
0	20170101	33723	NaN	NaN	NaN
1	20170102	24165	20170101.0	33723.0	-9558.0
2	20170103	27503	20170102.0	24165.0	3338.0
3	20170104	36165	20170103.0	27503.0	8662.0
4	20170105	37830	20170104.0	36165.0	1665.0
5	20170106	32387	20170105.0	37830.0	-5443.0
6	20170107	23415	20170106.0	32387.0	-8972.0
7	20170108	24737	20170107.0	23415.0	1322.0
8	20170109	26718	20170108.0	24737.0	1981.0
9	20170110	20143	20170109.0	26718.0	-6575.0

Commentary :

This code calculates the daily sales amount of a company and then creates a new dataframe by shifting the sales amount by one day, resulting in a new column for the previous day's sales amount. Then, it calculates the difference between the sales amount of a day and the sales amount of the previous day and creates a new column called 'diff_amount'.

Here's a line-by-line explanation:

df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index(): This code selects the columns 'sales_ymd' and 'amount' from the 'df_receipt' dataframe and groups the resulting data by the 'sales_ymd' column. It then calculates the sum of the 'amount' column for each 'sales_ymd' group and resets the index to create a new dataframe 'df_sales_amount_by_date' with the columns 'sales_ymd' and 'amount'.

df_sales_amount_by_date = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift()], axis=1): This code concatenates the 'df_sales_amount_by_date' dataframe with a shifted version of itself along the columns axis (i.e., horizontally). This results in a new dataframe with four columns: 'sales_ymd', 'amount', 'lag_ymd', and 'lag_amount'.

df_sales_amount_by_date.columns = ['sales_ymd','amount','lag_ymd','lag_amount']: This code renames the columns of the 'df_sales_amount_by_date' dataframe to more meaningful names.

df_sales_amount_by_date['diff_amount'] = df_sales_amount_by_date['amount'] - df_sales_amount_by_date['lag_amount']: This code creates a new column 'diff_amount' in the 'df_sales_amount_by_date' dataframe by subtracting the 'lag_amount' column from the 'amount' column.

df_sales_amount_by_date.head(10): This code displays the first 10 rows of the 'df_sales_amount_by_date' dataframe.

P-042: Tabulate the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd), and for each date, combine the data of the last, previous and three previous sales dates. Then display 10 results.

In [ ]:

# Code example 1: Vertical holding case
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']]. \
                                groupby('sales_ymd').sum().reset_index()

for i in range(1, 4):
    df_tmp = pd.concat([df_sales_amount_by_date, 
                        df_sales_amount_by_date.shift(i)], axis=1)
    if i == 1:
        df_lag = df_tmp
    else:
        df_lag = df_lag.append(df_tmp)

df_lag.columns = ['sales_ymd', 'amount', 'lag_ymd', 'lag_amount']

df_lag.dropna().astype(int).sort_values(['sales_ymd','lag_ymd']).head(10)

Out[ ]:

	sales_ymd	amount	lag_ymd	lag_amount
1	20170102	24165	20170101	33723
2	20170103	27503	20170101	33723
2	20170103	27503	20170102	24165
3	20170104	36165	20170101	33723
3	20170104	36165	20170102	24165
3	20170104	36165	20170103	27503
4	20170105	37830	20170102	24165
4	20170105	37830	20170103	27503
4	20170105	37830	20170104	36165
5	20170106	32387	20170103	27503

Commentary :

This code is performing a lag analysis on the sales data.

First, the code groups the sales data by date and sums the sales amount for each date to get the total sales amount for each day. This is stored in df_sales_amount_by_date.

Then, the code enters a for loop that iterates three times, from 1 to 3 (inclusive).

Within each iteration of the for loop, the code creates a new DataFrame called df_tmp by concatenating the df_sales_amount_by_date DataFrame with itself shifted by i rows. This results in a DataFrame where each row contains the sales data for a specific date and the sales data for the ith previous date.

If i is equal to 1, then the resulting DataFrame is stored in df_lag. Otherwise, the DataFrame is appended to df_lag.

The columns of df_lag are then renamed to sales_ymd, amount, lag_ymd, and lag_amount to indicate the date and amount of sales for the current date and the lagged date.

Finally, the code drops any rows containing NaN values (which would occur for the first i rows) and converts the sales amounts to integers before sorting the DataFrame by sales_ymd and lag_ymd and displaying the first 10 rows.

This provides a summary of the total sales amount for each date and the total sales amounts for the 1st, 2nd, and 3rd previous dates, allowing the user to perform lag analysis on the sales data.

In [ ]:

# Code example 2: Horizontal holding case
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].\
                                groupby('sales_ymd').sum().reset_index()

df_lag = df_sales_amount_by_date

for i in range(1, 4):
    df_lag = pd.concat([df_lag, df_sales_amount_by_date.shift(i)], axis=1)
    columns = [f'lag_ymd_{i}', f'lag_amount_{i}']
    df_lag.columns = list(df_lag.columns)[:-len(columns)] + columns

df_lag.dropna().astype(int).sort_values(['sales_ymd']).head(10)

Out[ ]:

	sales_ymd	amount	lag_ymd_1	lag_amount_1	lag_ymd_2	lag_amount_2	lag_ymd_3	lag_amount_3
3	20170104	36165	20170103	27503	20170102	24165	20170101	33723
4	20170105	37830	20170104	36165	20170103	27503	20170102	24165
5	20170106	32387	20170105	37830	20170104	36165	20170103	27503
6	20170107	23415	20170106	32387	20170105	37830	20170104	36165
7	20170108	24737	20170107	23415	20170106	32387	20170105	37830
8	20170109	26718	20170108	24737	20170107	23415	20170106	32387
9	20170110	20143	20170109	26718	20170108	24737	20170107	23415
10	20170111	24287	20170110	20143	20170109	26718	20170108	24737
11	20170112	23526	20170111	24287	20170110	20143	20170109	26718
12	20170113	28004	20170112	23526	20170111	24287	20170110	20143

Commentary :

The code provided performs the following operations:

The df_receipt DataFrame is used to select the sales_ymd and amount columns, then grouped by sales_ymd and the sum() function is applied to the amount column to calculate the total sales amount for each day. The resulting DataFrame is stored in df_sales_amount_by_date.

A new DataFrame called df_lag is created as a copy of df_sales_amount_by_date.

A loop is executed three times (from 1 to 4) and in each iteration, the shift() method is used to shift the df_sales_amount_by_date DataFrame by i periods (days). The resulting DataFrame is concatenated with df_lag horizontally using pd.concat(), and the resulting DataFrame is stored in df_lag.

The columns variable is created to store the names of the new columns that will be added to the df_lag DataFrame.

The columns names of df_lag are updated using the columns list, and the resulting DataFrame is stored in df_lag.

The dropna() method is called to remove any rows with missing values, and the astype() method is used to convert the DataFrame to integer type. Finally, the sort_values() method is used to sort the DataFrame in ascending order by sales_ymd, and the first 10 rows of the resulting DataFrame are returned using the head() method.

In summary, the code is creating a DataFrame df_lag with the sales amount for each day and also the sales amount for the previous 3 days, to identify trends in sales performance. The resulting DataFrame contains the sales amount for each day along with the sales amount for the previous 3 days, sorted by sales date.

P-043: Combine receipt details data (df_receipt) and customer data (df_customer) to create sales summary data, summing the sales amount (amount) by gender code (gender_cd) and age (calculated from age). Gender_cd should be 0 for male, 1 for female and 9 for unknown.

However, the item structure should consist of four items: age, female sales amount, male sales amount and sales amount of unknown gender (vertical cross tabulation of age and horizontal cross tabulation of gender). In addition, the age shall be in deciles of 10 years.

In [ ]:

# Code example 1
df_tmp = pd.merge(df_receipt, df_customer, how ='inner', on="customer_id")

df_tmp['era'] = df_tmp['age'].apply(lambda x: math.floor(x / 10) * 10)

df_sales_summary = pd.pivot_table(
                        df_tmp, index='era',
                        columns='gender_cd', 
                        values='amount',
                        aggfunc='sum'
                    ).reset_index()

df_sales_summary.columns = ['era', 'male', 'female', 'unknown']

df_sales_summary

Out[ ]:

	era	male	female	unknown
0	10	1591.0	149836.0	4317.0
1	20	72940.0	1363724.0	44328.0
2	30	177322.0	693047.0	50441.0
3	40	19355.0	9320791.0	483512.0
4	50	54320.0	6685192.0	342923.0
5	60	272469.0	987741.0	71418.0
6	70	13435.0	29764.0	2427.0
7	80	46360.0	262923.0	5111.0
8	90	NaN	6260.0	NaN

Commentary :

This code does the following:

Merges two dataframes df_receipt and df_customer on the common column customer_id, with an inner join, and saves the result to a new dataframe df_tmp.

Creates a new column era in the df_tmp dataframe, which is calculated by dividing the age of each customer by 10 and then rounding down to the nearest 10.

Calculates the sum of the amount column of the df_tmp dataframe for each unique combination of era and gender_cd values, and saves the result as a new dataframe df_sales_summary using the pd.pivot_table() function. The resulting dataframe has era values as the index and gender_cd values as the columns.

Renames the columns of the df_sales_summary dataframe to be more descriptive, with era as the first column, followed by male, female, and unknown, which represent the sum of the amount column for each respective gender_cd value.

In [ ]:

# Code example 2
df_tmp = pd.merge(df_receipt, df_customer, how ='inner', on="customer_id")

df_tmp['era'] = np.floor(df_tmp['age'] / 10).astype(int)  * 10

df_sales_summary = pd.pivot_table(df_tmp, index='era', columns='gender_cd', 
                                  values='amount', aggfunc='sum').reset_index()

df_sales_summary.columns = ['era', 'male', 'female', 'unknown']

df_sales_summary

Out[ ]:

	era	male	female	unknown
0	10	1591.0	149836.0	4317.0
1	20	72940.0	1363724.0	44328.0
2	30	177322.0	693047.0	50441.0
3	40	19355.0	9320791.0	483512.0
4	50	54320.0	6685192.0	342923.0
5	60	272469.0	987741.0	71418.0
6	70	13435.0	29764.0	2427.0
7	80	46360.0	262923.0	5111.0
8	90	NaN	6260.0	NaN

Commentary :

This code performs the following operations:

It merges two dataframes, df_receipt and df_customer on the common column 'customer_id' using an inner join. The resulting dataframe is stored in a new variable called df_tmp.

It creates a new column 'era' in df_tmp by dividing the 'age' column by 10 and taking the floor, then multiplying by 10 to get the decade. This column represents the age group of each customer in decades.

It creates a pivot table called df_sales_summary by using the pd.pivot_table() function. The pivot table is based on the df_tmp dataframe and has rows indexed by the 'era' column, columns grouped by the 'gender_cd' column, and values computed from the 'amount' column using the sum aggregation function. The resulting table shows the total sales amount for each gender in each age group.

It renames the columns in df_sales_summary to be more descriptive, with 'male', 'female', and 'unknown' representing the total sales amounts for each gender in each age group.

P-044: The sales summary data (df_sales_summary) created in 043 had sales by gender carried horizontally. From this data, convert the gender into three items: age, gender code and sales amount, by having the gender carried vertically. The gender code should be “00” for male, “01” for female and “99” for unknown.

In [ ]:

df_sales_summary.set_index('era'). \
    stack().reset_index().replace({'female':'01','male':'00','unknown':'99'}). \
    rename(columns={'level_1':'gender_cd', 0: 'amount'})

Out[ ]:

	era	gender_cd	amount
0	10	00	1591.0
1	10	01	149836.0
2	10	99	4317.0
3	20	00	72940.0
4	20	01	1363724.0
5	20	99	44328.0
6	30	00	177322.0
7	30	01	693047.0
8	30	99	50441.0
9	40	00	19355.0
10	40	01	9320791.0
11	40	99	483512.0
12	50	00	54320.0
13	50	01	6685192.0
14	50	99	342923.0
15	60	00	272469.0
16	60	01	987741.0
17	60	99	71418.0
18	70	00	13435.0
19	70	01	29764.0
20	70	99	2427.0
21	80	00	46360.0
22	80	01	262923.0
23	80	99	5111.0
24	90	01	6260.0

Commentary :

This code transforms a pandas DataFrame df_sales_summary by setting the index to era, stacking the columns female, male, and unknown to a new column gender_cd, and then resetting the index. It then replaces the gender_cd values with 01 for female, 00 for male, and 99 for unknown. Finally, it renames the columns level_1 to gender_cd and 0 to amount.

Here is a step-by-step explanation:

df_sales_summary.set_index('era'): This sets the index of df_sales_summary to the column era. This makes it easier to group and pivot the data based on era.

.stack(): This stacks the columns female, male, and unknown into a single column, with a new index level added for the stacked column.

.reset_index(): This resets the index of the DataFrame to default integer index.

.replace({'female':'01','male':'00','unknown':'99'}): This replaces the values in the gender_cd column with the corresponding codes.

.rename(columns={'level_1':'gender_cd', 0: 'amount'}): This renames the columns level_1 to gender_cd and 0 to amount. The level_1 column was created by the stack() method, and contains the original column names of the stacked columns.

P-045: The date of birth (birth_day) in the customer data (df_customer) is held in date type. Convert this to a string in YYYYMMDD format and display 10 cases together with the customer ID (customer_id).

In [ ]:

# You can convert to a YYYYYMMDD format string by writing
# pd.to_datetime(df_customer['birth_day']).dt.strftime('%Y%m%d')

pd.concat([df_customer['customer_id'],
           pd.to_datetime(df_customer['birth_day']).dt.strftime('%Y%m%d')],
          axis = 1).head(10)

Out[ ]:

	customer_id	birth_day
0	CS021313000114	19810429
1	CS037613000071	19520401
2	CS031415000172	19761004
3	CS028811000001	19330327
4	CS001215000145	19950329
5	CS020401000016	19740915
6	CS015414000103	19770809
7	CS029403000008	19730817
8	CS015804000004	19310502
9	CS033513000180	19620711

Commentary : 

This code creates a new DataFrame by concatenating two columns of the existing df_customer DataFrame: the customer_id column and the birth_day column, which contains date information. 

The pd.to_datetime() function is applied to the birth_day column to convert it to a datetime format, and then the dt.strftime('%Y%m%d') method is applied to format the datetime as a string in the format '%Y%m%d' (i.e., year, month, day). 

This creates a new column in the new DataFrame with a string representation of the customer's birth date. Finally, the head() method is used to display the first 10 rows of the new DataFrame.

P-046: The application date (application_date) of the customer data (df_customer) holds data in a string type in YYYYMMDD format. Convert this to a date type and display 10 cases together with the customer ID (customer_id).

In [ ]:

pd.concat([df_customer['customer_id'],
           pd.to_datetime(df_customer['application_date'])], axis=1).head(10)

Out[ ]:

	customer_id	application_date
0	CS021313000114	1970-01-01 00:00:00.020150905
1	CS037613000071	1970-01-01 00:00:00.020150414
2	CS031415000172	1970-01-01 00:00:00.020150529
3	CS028811000001	1970-01-01 00:00:00.020160115
4	CS001215000145	1970-01-01 00:00:00.020170605
5	CS020401000016	1970-01-01 00:00:00.020150225
6	CS015414000103	1970-01-01 00:00:00.020150722
7	CS029403000008	1970-01-01 00:00:00.020150515
8	CS015804000004	1970-01-01 00:00:00.020150607
9	CS033513000180	1970-01-01 00:00:00.020150728

Commentary :

This code snippet concatenates two columns from a pandas DataFrame df_customer - 'customer_id' and 'application_date' - into a new DataFrame using the pd.concat() method.

The pd.to_datetime() method is used to convert the 'application_date' column to a datetime object. This method is a convenient way to convert date strings in various formats to a standard datetime object in pandas.

The resulting DataFrame has two columns - 'customer_id' and 'application_date' - with the dates in datetime format.

P-047: The sales date (sales_ymd) of the receipt details data (df_receipt) holds data in numeric type in YYYYMMDD format. Convert this to a date type and display 10 items together with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
           pd.to_datetime(df_receipt['sales_ymd'].astype('str'))],
          axis=1).head(10)

Out[ ]:

	receipt_no	receipt_sub_no	sales_ymd
0	112	1	2018-11-03
1	1132	2	2018-11-18
2	1102	1	2017-07-12
3	1132	1	2019-02-05
4	1102	2	2018-08-21
5	1112	1	2019-06-05
6	1102	2	2018-12-05
7	1102	1	2019-09-22
8	1112	2	2017-05-04
9	1102	1	2019-10-10

Commentary :

This code is using pd.concat function to combine two pandas dataframes:

df_receipt[['receipt_no', 'receipt_sub_no']]: This selects the columns receipt_no and receipt_sub_no from the df_receipt dataframe.

pd.to_datetime(df_receipt['sales_ymd'].astype('str')): This converts the sales_ymd column in df_receipt dataframe to a datetime format.

The axis=1 argument in pd.concat indicates that the dataframes should be concatenated horizontally, column-wise. The resulting dataframe has the columns receipt_no, receipt_sub_no and sales_ymd (as a datetime format), and the first 10 rows are displayed using .head(10) method.

P-048: The sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) holds data in UNIX seconds of numeric type. Convert it to date type and display 10 cases together with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
    pd.to_datetime(df_receipt['sales_epoch'], unit='s').rename('sales_ymd')], 
    axis=1).head(10)

Out[ ]:

	receipt_no	receipt_sub_no	sales_ymd
0	112	1	2018-11-03
1	1132	2	2018-11-18
2	1102	1	2017-07-12
3	1132	1	2019-02-05
4	1102	2	2018-08-21
5	1112	1	2019-06-05
6	1102	2	2018-12-05
7	1102	1	2019-09-22
8	1112	2	2017-05-04
9	1102	1	2019-10-10

Commentary :

This code performs the following operations:

It selects two columns from the df_receipt DataFrame: receipt_no and receipt_sub_no.

It converts the sales_epoch column to a datetime format using the to_datetime() function of the pandas library. The unit parameter is set to 's' to indicate that the input is in seconds.

The resulting datetime column is then renamed to 'sales_ymd' using the rename() function of the pandas library.

Finally, the two DataFrames are concatenated along the columns axis (axis=1) using the concat() function of the pandas library.

The resulting DataFrame will have three columns: receipt_no, receipt_sub_no, and sales_ymd, where sales_ymd is the sales date and time in a datetime format. The head(10) function is used to display the first 10 rows of the resulting DataFrame.

P-049: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “年(year)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).

In [ ]:

pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
           pd.to_datetime(df_receipt['sales_epoch'], 
                          unit='s').dt.year.rename('sales_year')],
          axis=1).head(10)

Out[ ]:

	receipt_no	receipt_sub_no	sales_year
0	112	1	2018
1	1132	2	2018
2	1102	1	2017
3	1132	1	2019
4	1102	2	2018
5	1112	1	2019
6	1102	2	2018
7	1102	1	2019
8	1112	2	2017
9	1102	1	2019

Commentary :

This code creates a new DataFrame by concatenating the columns receipt_no and receipt_sub_no from df_receipt with the year of the sales_epoch column converted to a datetime object using the to_datetime method with the unit parameter set to s to indicate that the values in sales_epoch are Unix timestamps (i.e., the number of seconds since January 1, 1970).

Here's a breakdown of the code line by line:

pd.concat: concatenates the two DataFrames along the columns axis

df_receipt[['receipt_no', 'receipt_sub_no']]: selects the columns receipt_no and receipt_sub_no from the df_receipt DataFrame

pd.to_datetime(df_receipt['sales_epoch'], unit='s'): converts the values in the sales_epoch column to a datetime object using the to_datetime method with the unit parameter set to s

dt.year: extracts the year component of the datetime object

rename('sales_year'): renames the resulting column to sales_year

axis=1: specifies that the concatenation should be performed along the columns axis

.head(10): returns the first 10 rows of the resulting DataFrame.

P-050: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “月(month)” and display 10 items with receipt number (receipt_no) and receipt sub number (receipt_sub_no). Note that the “月(month)” should be retrieved with two digits filled with zero.

In [ ]:

# dt.month can also be used to get the month, but here strftime is used to retrieve the month with two digits filled with zero.

df_datetime = pd.to_datetime(df_receipt['sales_epoch'], 
                             unit='s').rename('sales_month')

pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
           df_datetime.dt.strftime('%m')],axis=1).head(10)

Out[ ]:

	receipt_no	receipt_sub_no	sales_month
0	112	1	11
1	1132	2	11
2	1102	1	07
3	1132	1	02
4	1102	2	08
5	1112	1	06
6	1102	2	12
7	1102	1	09
8	1112	2	05
9	1102	1	10

Commentary :

The given code performs the following operations:

Extracts the 'sales_epoch' column from the 'df_receipt' dataframe.

Converts the 'sales_epoch' values into datetime format using the 'to_datetime' function with the 'unit' parameter set to 's' for seconds.

Renames the resultant series to 'sales_month' using the 'rename' function.

Extracts the month value from the 'sales_month' column using the 'strftime' function with the '%m' format.

Concatenates the resultant series with the 'receipt_no' and 'receipt_sub_no' columns of the 'df_receipt' dataframe using the 'concat' function.

The resulting dataframe will contain the 'receipt_no', 'receipt_sub_no', and 'sales_month' columns where 'sales_month' will have the month values extracted from the 'sales_epoch' column.

P-051: Convert the sales epoch seconds of the receipt details data (df_receipt) to date type, extract only the “日(day)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no). Note that the “日(day)” should be extracted with two digits filled with zero.

In [ ]:

# dt.day can also be used to get the day, but here strftime is used to get the day with two digits filled with zero.

df_datetime = pd.to_datetime(df_receipt['sales_epoch'], 
                             unit='s').rename('sales_day')

pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
           df_datetime.dt.strftime('%d')], axis=1).head(10)

Out[ ]:

	receipt_no	receipt_sub_no	sales_day
0	112	1	03
1	1132	2	18
2	1102	1	12
3	1132	1	05
4	1102	2	21
5	1112	1	05
6	1102	2	05
7	1102	1	22
8	1112	2	04
9	1102	1	10

Commentary :

This code takes the sales_epoch column from the df_receipt DataFrame, which contains the number of seconds since January 1, 1970 (also known as Unix time), and converts it into a Pandas datetime object with second precision using pd.to_datetime(). The resulting datetime object is assigned to a new column called sales_day.

The strftime() method is then applied to the sales_day column to extract the day component of the date in the format of two digits with leading zeros, denoted by %d. The resulting string is concatenated with the receipt_no and receipt_sub_no columns using pd.concat(), resulting in a new DataFrame with the three columns.

Finally, the head() method is called on the resulting DataFrame to display the first 10 rows. This code essentially extracts the day of the month from the sales_epoch column of df_receipt, and combines it with the receipt numbers to produce a new DataFrame.

P-052: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), binaryise the total sales amount to 0 for amounts less than 2,000 yen and 1 for amounts greater than 2,000 yen, and display 10 items with the customer ID and total sales amount. Display 10 items together with the customer ID and the total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

# Code example 1
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python')

df_sales_amount = df_sales_amount[['customer_id', 'amount']]. \
                                    groupby('customer_id').sum().reset_index()

df_sales_amount['sales_flg'] = df_sales_amount['amount']. \
                                    apply(lambda x: 1 if x > 2000 else 0)

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	sales_flg
0	CS001113000004	1298	0
1	CS001114000005	626	0
2	CS001115000010	3044	1
3	CS001205000004	1988	0
4	CS001205000006	3337	1
5	CS001211000025	456	0
6	CS001212000027	448	0
7	CS001212000031	296	0
8	CS001212000046	228	0
9	CS001212000070	456	0

Commentary :

The code performs the following operations:

Selects rows from df_receipt where customer_id does not start with the letter 'Z' using the query method and assigns it to df_sales_amount.

Selects the customer_id and amount columns from df_sales_amount and calculates the sum of amount for each customer_id using the groupby method. The result is assigned to df_sales_amount.

Adds a new column sales_flg to df_sales_amount which contains 1 if the amount is greater than 2000 and 0 otherwise, using the apply method with a lambda function.

Displays the first 10 rows of the resulting DataFrame df_sales_amount.

So, the resulting DataFrame df_sales_amount contains the customer_id, total amount spent by each customer and a binary flag sales_flg indicating whether the customer's spending is high (greater than 2000) or not.

In [ ]:

# Code example 2 (use of np.where)
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python')

df_sales_amount = df_sales_amount[['customer_id', 'amount']]. \
                                    groupby('customer_id').sum().reset_index()

df_sales_amount['sales_flg'] = np.where(df_sales_amount['amount'] > 2000, 1, 0)

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	sales_flg
0	CS001113000004	1298	0
1	CS001114000005	626	0
2	CS001115000010	3044	1
3	CS001205000004	1988	0
4	CS001205000006	3337	1
5	CS001211000025	456	0
6	CS001212000027	448	0
7	CS001212000031	296	0
8	CS001212000046	228	0
9	CS001212000070	456	0

Commentary :

This code is generating a summary table of customer sales information. Here's what each line does:

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'): This line filters the receipt data to exclude customers whose IDs start with "Z". The query method is used here to apply the filter, and engine='python' is specified to avoid a warning message. The filtered data is assigned to a new dataframe called df_sales_amount.

df_sales_amount = df_sales_amount[['customer_id', 'amount']].groupby('customer_id').sum().reset_index(): This line groups the df_sales_amount dataframe by customer_id, calculates the sum of the amount column for each group, and resets the index to make customer_id a column again. The result is a dataframe that shows the total sales amount for each customer.

df_sales_amount['sales_flg'] = np.where(df_sales_amount['amount'] > 2000, 1, 0): This line adds a new column to the df_sales_amount dataframe called sales_flg. The values in this column are set based on whether the customer's total sales amount (amount) is greater than 2000 or not. The np.where function is used here to apply the condition and assign either 1 or 0 to the new column.

df_sales_amount.head(10): This line displays the first 10 rows of the df_sales_amount dataframe. The resulting table shows the customer_id, total amount of sales, and whether the customer's sales_flg is 1 (if their sales amount was greater than 2000) or 0 (if their sales amount was less than or equal to 2000).

P-053: Binarise the postcode (postal_cd) of the customer data (df_customer) to 1 for Tokyo (the first three digits are 100 to 209) and 0 for all others. Further, combine it with the receipt details data (df_receipt) and count the number of customers with sales results for the entire period, for each binary value created.

In [ ]:

# Code example 1
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()

df_tmp['postal_flg'] = df_tmp['postal_cd']. \
                        apply(lambda x: 1 if 100 <= int(x[0:3]) <= 209 else 0)

pd.merge(df_tmp, df_receipt, how='inner', on='customer_id'). \
    groupby('postal_flg').agg({'customer_id':'nunique'})

Out[ ]:

	customer_id
postal_flg
0	3906
1	4400

Commentary :

This code performs the following tasks:

Creates a copy of the "customer_id" and "postal_cd" columns from the "df_customer" DataFrame and assigns it to a new DataFrame called "df_tmp".

Creates a new column "postal_flg" in "df_tmp" based on the first 3 digits of the "postal_cd" column. If the first 3 digits are between 100 and 209 (inclusive), "postal_flg" is set to 1, otherwise it is set to 0.

Merges "df_tmp" and "df_receipt" on the "customer_id" column using an inner join.

Groups the resulting DataFrame by the "postal_flg" column and counts the number of unique customer IDs in each group using the "nunique()" function.

Returns the resulting DataFrame showing the count of unique customer IDs grouped by the "postal_flg" column.

In summary, this code is used to analyze the number of customers living in specific postal code regions based on their purchase history.

In [ ]:

# Code example 2 (np.where, use of between)
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()

df_tmp['postal_flg'] = np.where(df_tmp['postal_cd'].str[0:3].astype(int)
                                .between(100, 209), 1, 0)

pd.merge(df_tmp, df_receipt, how='inner', on='customer_id'). \
    groupby('postal_flg').agg({'customer_id':'nunique'})

Out[ ]:

	customer_id
postal_flg
0	3906
1	4400

Commentary :

This code is used to create a new data frame df_tmp with a postal_flg column that indicates whether the customer's postal code is within a certain range or not. Here is the step-by-step explanation of the code:

df_tmp = df_customer[['customer_id', 'postal_cd']].copy() creates a new data frame df_tmp containing only the customer_id and postal_cd columns from df_customer.

df_tmp['postal_flg'] = np.where(df_tmp['postal_cd'].str[0:3].astype(int).between(100, 209), 1, 0) creates a new column postal_flg in df_tmp using the np.where function. The str[0:3] code extracts the first three characters of the postal code as a string, and astype(int) converts it to an integer. The between function checks whether the integer falls within the range 100 to 209. If it does, 1 is assigned to postal_flg, otherwise 0 is assigned.

pd.merge(df_tmp, df_receipt, how='inner', on='customer_id') merges df_tmp with df_receipt on the customer_id column, keeping only the rows where the customer is in both data frames.

groupby('postal_flg').agg({'customer_id':'nunique'}) groups the merged data frame by postal_flg and counts the number of unique customer_id values in each group using the nunique function. This returns a data frame with one row for each postal_flg, showing the number of customers in each group.

P-054: The address of the customer data (df_customer) is one of 埼玉県(Saitama-prefecture), 千葉県(Chiba-prefecture), 東京都(Tokyo-Metropolitan) or 神奈川県(Kanagawa-prefecture). Create a code value for each prefecture, and display 10 cases together with the customer ID and address. The values should be 11 for 埼玉県(Saitama-prefecture), 12 for 千葉県(Chiba-prefecture), 13 for 東京都(Tokyo-Metropolitan) and 14 for 神奈川県(Kanagawa-prefecture).

In [ ]:

# Code example 1 (cut out at fixed)
df_customer_tmp = df_customer[['customer_id', 'address']].copy()

df_customer_tmp['prefecture_cd'] = \
            df_customer['address'].str[0:3].map({'埼玉県': '11',
                                                '千葉県':'12', 
                                                '東京都':'13', 
                                                '神奈川':'14'})

df_customer_tmp.head(10)

Out[ ]:

	customer_id	address	prefecture_cd
0	CS021313000114	神奈川県伊勢原市粟窪**********	14
1	CS037613000071	東京都江東区南砂**********	13
2	CS031415000172	東京都渋谷区代々木**********	13
3	CS028811000001	神奈川県横浜市泉区和泉町**********	14
4	CS001215000145	東京都大田区仲六郷**********	13
5	CS020401000016	東京都板橋区若木**********	13
6	CS015414000103	東京都江東区北砂**********	13
7	CS029403000008	千葉県浦安市海楽**********	12
8	CS015804000004	東京都江東区北砂**********	13
9	CS033513000180	神奈川県横浜市旭区善部町**********	14

Commentary :

This code is extracting the prefecture code from the address column of the df_customer DataFrame and creating a new DataFrame df_customer_tmp with columns 'customer_id', 'address', and 'prefecture_cd'.

The code first creates a copy of the 'customer_id' and 'address' columns of the df_customer DataFrame using the copy() method and assigns it to the df_customer_tmp variable.

df_customer_tmp = df_customer[['customer_id', 'address']].copy()
The code then extracts the first 3 characters of the 'address' column using the str accessor and assigns it to a new column 'prefecture_cd' using the map() method.

df_customer_tmp['prefecture_cd'] = df_customer['address'].str[0:3].map({'埼玉県': '11', '千葉県':'12', '東京都':'13', '神奈川':'14'})

The map() method maps each prefecture name to its corresponding prefecture code. The resulting DataFrame df_customer_tmp has columns 'customer_id', 'address', and 'prefecture_cd'.

In [ ]:

# Code example 2 (using regular expressions)
df_customer_tmp = df_customer[['customer_id', 'address']].copy()

df_customer_tmp['prefecture_cd'] = \
            df_customer['address'].str.extract(r'(^.*?[都道府県])')[0].\
                                            map({'埼玉県': '11',
                                                 '千葉県':'12', 
                                                 '東京都':'13', 
                                                 '神奈川県':'14'})

df_customer_tmp.head(10)

Out[ ]:

	customer_id	address	prefecture_cd
0	CS021313000114	神奈川県伊勢原市粟窪**********	14
1	CS037613000071	東京都江東区南砂**********	13
2	CS031415000172	東京都渋谷区代々木**********	13
3	CS028811000001	神奈川県横浜市泉区和泉町**********	14
4	CS001215000145	東京都大田区仲六郷**********	13
5	CS020401000016	東京都板橋区若木**********	13
6	CS015414000103	東京都江東区北砂**********	13
7	CS029403000008	千葉県浦安市海楽**********	12
8	CS015804000004	東京都江東区北砂**********	13
9	CS033513000180	神奈川県横浜市旭区善部町**********	14

Commentary :

This code creates a new DataFrame df_customer_tmp by selecting columns customer_id and address from an existing DataFrame df_customer.

Next, it extracts the prefecture code from the address column using regular expressions. Specifically, it uses str.extract() method to extract the substring that matches the pattern (^.*?[都道府県]), which means "match any characters from the start of the string up to the first occurrence of '都' or '道' or '府' or '県'".

Then, it maps the extracted prefecture names to corresponding prefecture codes and creates a new column named prefecture_cd in the df_customer_tmp DataFrame. The mapping is done using the map() method with a dictionary containing prefecture names as keys and their corresponding codes as values.

The resulting DataFrame df_customer_tmp has three columns: customer_id, address, and prefecture_cd, where prefecture_cd is a two-digit string representing the prefecture code. The head(10) method is used to display the first 10 rows of the DataFrame.

P-055: Sum the sales amount (amount) of the receipt details (df_receipt) data for each customer ID (customer_id) and find the quartile point of the total amount. Then, create category values for the total sales amount for each customer based on the following criteria, and display the 10 results together with the customer ID and the total sales amount. The category values shall be 1-4 in order.

Minimum value and above and below the first quartile – Assign 1

First quartile and above but below the second quartile – Assign 2

Second quartile and above but below the third quartile – Assign 3

Third quartile and above – Assign 4

In [ ]:

# Code example 1
df_sales_amount = df_receipt[['customer_id', 'amount']]. \
                        groupby('customer_id').sum().reset_index()

pct25 = np.quantile(df_sales_amount['amount'], 0.25)
pct50 = np.quantile(df_sales_amount['amount'], 0.5)
pct75 = np.quantile(df_sales_amount['amount'], 0.75)

def pct_group(x):
    if x < pct25:
        return 1
    elif pct25 <= x < pct50:
        return 2
    elif pct50 <= x < pct75:
        return 3
    elif pct75 <= x:
        return 4

df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(pct_group)

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	pct_group
0	CS001113000004	1298	2
1	CS001114000005	626	2
2	CS001115000010	3044	3
3	CS001205000004	1988	3
4	CS001205000006	3337	3
5	CS001211000025	456	1
6	CS001212000027	448	1
7	CS001212000031	296	1
8	CS001212000046	228	1
9	CS001212000070	456	1

Commentary :

This code computes quartiles of sales amount for each customer, and then assigns each customer to one of four groups based on their sales amount quartile.

Here is a step-by-step breakdown of the code:

df_sales_amount = df_receipt[['customer_id', 'amount']].groupby('customer_id').sum().reset_index() computes the total sales amount for each customer by grouping the df_receipt dataframe by customer_id and summing up the amount column. The resulting dataframe has two columns: customer_id and amount.

pct25 = np.quantile(df_sales_amount['amount'], 0.25), pct50 = np.quantile(df_sales_amount['amount'], 0.5), and pct75 = np.quantile(df_sales_amount['amount'], 0.75) compute the 25th, 50th, and 75th percentiles of the sales amount distribution.

def pct_group(x): ... defines a function pct_group that takes a sales amount value x and returns an integer from 1 to 4 representing the quartile that x belongs to.

df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(pct_group) applies the pct_group function to each value in the amount column of df_sales_amount, and assigns the resulting quartile number to a new column called pct_group.

df_sales_amount.head(10) displays the first 10 rows of the resulting dataframe, which now has three columns: customer_id, amount, and pct_group.

In [ ]:

# code for confirmation
print('pct25:', pct25)
print('pct50:', pct50)
print('pct75:', pct75)

pct25: 548.5
pct50: 1478.0
pct75: 3651.0

In [ ]:

# Code example 2 (example using cut, with quartile range additionally displayed for reference)
df_temp = df_receipt[['customer_id', 'amount']]. \
                        groupby('customer_id').sum().reset_index()

pct25 = np.quantile(df_sales_amount['amount'], 0.25)
pct50 = np.quantile(df_sales_amount['amount'], 0.5)
pct75 = np.quantile(df_sales_amount['amount'], 0.75)
pct_max = df_sales_amount['amount'].max()

df_temp['quantile'] = pd.cut(df_sales_amount['amount'],[0.0, pct25, pct50, pct75,pct_max+0.1], right=False)

df_temp['pct_group'] = df_temp.groupby('quantile').ngroup() + 1

df_temp.head(10)

Out[ ]:

	customer_id	amount	quantile	pct_group
0	CS001113000004	1298	[548.5, 1478.0)	2
1	CS001114000005	626	[548.5, 1478.0)	2
2	CS001115000010	3044	[1478.0, 3651.0)	3
3	CS001205000004	1988	[1478.0, 3651.0)	3
4	CS001205000006	3337	[1478.0, 3651.0)	3
5	CS001211000025	456	[0.0, 548.5)	1
6	CS001212000027	448	[0.0, 548.5)	1
7	CS001212000031	296	[0.0, 548.5)	1
8	CS001212000046	228	[0.0, 548.5)	1
9	CS001212000070	456	[0.0, 548.5)	1

Commentary :

This code is creating a new column 'pct_group' in a DataFrame 'df_temp' that contains the total sales amount of each customer and the percentile group they belong to based on their total sales amount.

The first line groups the 'df_receipt' DataFrame by customer_id and calculates the sum of sales amount for each customer, and then resets the index to make the customer_id a column again.

The next four lines calculate the 25th, 50th, 75th percentiles and the maximum value of the sales amount from the 'df_sales_amount' DataFrame, which is the same as the previously created 'df_temp' DataFrame.

The next line creates a new column 'quantile' in 'df_temp' by cutting the sales amount into four bins with the defined percentile values.

Finally, the last line creates a new column 'pct_group' in 'df_temp' that groups the data by 'quantile' and assigns a group number (1-4) to each customer based on which percentile group they belong to.

Note that the right endpoint of the last bin is defined as 'pct_max+0.1' to ensure that the maximum sales amount value is included in the last percentile group.

In [ ]:

# Reference code (example using qcut, not intended to meet the subject matter as the inclusion/exclusion of boundary values is reversed, but included for reference only).
df_temp = df_receipt.groupby('customer_id')[['amount']].sum()

df_temp['quantile'], bins = \
    pd.qcut(df_receipt.groupby('customer_id')['amount'].sum(), 4, retbins=True) 

df_temp['pct_group'] = df_temp.groupby('quantile').ngroup() + 1

df_temp.reset_index(inplace=True)

display(df_temp.head(10))

print('quantiles:', bins)

	customer_id	amount	quantile	pct_group
0	CS001113000004	1298	(548.5, 1478.0]	2
1	CS001114000005	626	(548.5, 1478.0]	2
2	CS001115000010	3044	(1478.0, 3651.0]	3
3	CS001205000004	1988	(1478.0, 3651.0]	3
4	CS001205000006	3337	(1478.0, 3651.0]	3
5	CS001211000025	456	(69.999, 548.5]	1
6	CS001212000027	448	(69.999, 548.5]	1
7	CS001212000031	296	(69.999, 548.5]	1
8	CS001212000046	228	(69.999, 548.5]	1
9	CS001212000070	456	(69.999, 548.5]	1

quantiles: [7.0000000e+01 5.4850000e+02 1.4780000e+03 3.6510000e+03 1.2395003e+07]

Commentary :

This code performs the following operations:

It groups the df_receipt dataframe by customer_id and then calculates the sum of amount for each group using the sum() method of the grouped object. The result is stored in a new dataframe called df_temp.

The qcut() function is used to divide the amount values in df_temp into 4 equal-sized bins based on their values. The retbins=True option returns the edges of the bins in the bins variable.

A new column called quantile is created in df_temp that indicates which bin each amount value falls into.

The ngroup() method is used to assign a group number to each bin in df_temp['quantile']. The +1 is added to the result because ngroup() returns 0-indexed values.

The reset_index() method is called on df_temp to reset the index to a new range index.

Finally, the first 10 rows of df_temp and the edges of the bins are displayed using the display() and print() functions, respectively.

P-056: Calculate the age in 10-year age increments based on the age (age) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id) and date of birth (birth_day). However, all age groups above 60 years should be assumed to be in the 60s. The name of the category representing the age is optional.

In [ ]:

# Code example 1
df_customer_era = df_customer[['customer_id', 'birth_day']].copy()

df_customer_era['era'] = df_customer['age']. \
                                apply(lambda x: min(math.floor(x / 10) * 10, 60))

df_customer_era.head(10)

Out[ ]:

	customer_id	birth_day	era
0	CS021313000114	1981-04-29	30
1	CS037613000071	1952-04-01	60
2	CS031415000172	1976-10-04	40
3	CS028811000001	1933-03-27	60
4	CS001215000145	1995-03-29	20
5	CS020401000016	1974-09-15	40
6	CS015414000103	1977-08-09	40
7	CS029403000008	1973-08-17	40
8	CS015804000004	1931-05-02	60
9	CS033513000180	1962-07-11	50

Commentary :

This code creates a new DataFrame df_customer_era containing the columns customer_id, birth_day, and era.

The customer_id and birth_day columns are taken from the original df_customer DataFrame.

The era column is computed from the age column in the df_customer DataFrame. For each value of age, the code computes the decade (in multiples of 10) by which the value should be rounded down. If the value of age is 60 or greater, then the era is set to 60.

The apply method applies a function (in this case, a lambda function) to each element in the age column, and the min and math.floor functions compute the decade. The resulting value is assigned to the era column in df_customer_era.

Finally, the head method is called to display the first 10 rows of the resulting DataFrame.

In [ ]:

# Code example 2 (example of cut, categories are output in a range)
df_customer_era = df_customer[['customer_id', 'birth_day']].copy()

df_customer_era['era'] = pd.cut(df_customer['age'], 
                                  bins=[0, 10, 20, 30, 40, 50, 60, np.inf], 
                                  right=False)

df_customer_era[['customer_id', 'birth_day', 'era']].head(10)

Out[ ]:

	customer_id	birth_day	era
0	CS021313000114	1981-04-29	[30.0, 40.0)
1	CS037613000071	1952-04-01	[60.0, inf)
2	CS031415000172	1976-10-04	[40.0, 50.0)
3	CS028811000001	1933-03-27	[60.0, inf)
4	CS001215000145	1995-03-29	[20.0, 30.0)
5	CS020401000016	1974-09-15	[40.0, 50.0)
6	CS015414000103	1977-08-09	[40.0, 50.0)
7	CS029403000008	1973-08-17	[40.0, 50.0)
8	CS015804000004	1931-05-02	[60.0, inf)
9	CS033513000180	1962-07-11	[50.0, 60.0)

Commentary :

This code creates a new DataFrame df_customer_era with columns customer_id, birth_day, and era. The customer_id and birth_day columns are copied over from the original df_customer DataFrame.

The era column is created by binning the ages of customers into different age ranges using pd.cut(). The bins parameter specifies the edges of each bin, where np.inf is used to specify an open-ended bin for customers who are 60 years or older. The right parameter is set to False to specify that the intervals are left-inclusive, i.e., a customer whose age is exactly 30 will be placed in the [30, 40) bin.

For each row in the DataFrame, the age of the customer is binned into the corresponding era based on the specified bins. The resulting DataFrame contains the customer_id, birth_day, and era columns, with the era column containing the age range bin for each customer.

P-057: Using the extraction results of 056 and the gender code (gender_cd), create new category data representing the combination of gender x age and display 10 cases. The value of the category representing the combination shall be arbitrary.

In [ ]:

# Generate a sex and age code by concatenating a one-digit sex code and a two-digit age code.

df_customer_era = df_customer[['customer_id', 'birth_day']].copy()

df_customer_era['era'] = df_customer['age']. \
                                apply(lambda x: min(math.floor(x / 10) * 10, 60))

df_customer_era['gender_era'] = \
    df_customer['gender_cd'] + df_customer_era['era'].astype('str').str.zfill(2)

df_customer_era.head(10)

Out[ ]:

	customer_id	birth_day	era	gender_era
0	CS021313000114	1981-04-29	30	130
1	CS037613000071	1952-04-01	60	960
2	CS031415000172	1976-10-04	40	140
3	CS028811000001	1933-03-27	60	160
4	CS001215000145	1995-03-29	20	120
5	CS020401000016	1974-09-15	40	040
6	CS015414000103	1977-08-09	40	140
7	CS029403000008	1973-08-17	40	040
8	CS015804000004	1931-05-02	60	060
9	CS033513000180	1962-07-11	50	150

Commentary :

This code is creating a new column "era" in the DataFrame "df_customer_era", which indicates the decade of a customer's age. The decade values are obtained by rounding down the age to the nearest 10 and then multiplying by 10. If the result is greater than 60, it is truncated to 60.

Then, the code creates a new column "gender_era" by concatenating the gender code and the "era" column values. The "gender_cd" column contains gender codes (e.g., "0" for male, "1" for female), and the "era" column values are cast as strings and zero-padded to a width of 2.

Finally, the code displays the first 10 rows of the resulting DataFrame.

P-058: Create a dummy variable for the gender code (gender_cd) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id).

In [ ]:

# Code example 1 (all code values itemised)
pd.get_dummies(df_customer[['customer_id', 'gender_cd']], 
               columns=['gender_cd']).head(10)

Out[ ]:

	customer_id	gender_cd_0	gender_cd_1	gender_cd_9
0	CS021313000114	0	1	0
1	CS037613000071	0	0	1
2	CS031415000172	0	1	0
3	CS028811000001	0	1	0
4	CS001215000145	0	1	0
5	CS020401000016	1	0	0
6	CS015414000103	0	1	0
7	CS029403000008	1	0	0
8	CS015804000004	1	0	0
9	CS033513000180	0	1	0

Commentary :

This code creates dummy variables for the categorical variable 'gender_cd' in the DataFrame 'df_customer'. The resulting DataFrame will have a column for each category in 'gender_cd' (presumably male and female), and each row will have a 1 in the appropriate column if the original customer record had that value for 'gender_cd', and 0s in all other columns.

The 'get_dummies' function from pandas is used for this purpose. It takes in a DataFrame and a list of columns to apply the dummy encoding to. In this case, it is called on the 'df_customer' DataFrame, and the 'gender_cd' column is specified in the 'columns' parameter.

The resulting DataFrame will have one column for each value of 'gender_cd' (presumably 0 for male and 1 for female, or vice versa depending on the encoding used in the original data), and each row will have a 1 in the column corresponding to the value of 'gender_cd' in that row, and 0s in all other columns. The first 10 rows of the resulting DataFrame are returned.

In [ ]:

# Code example 2 (one item can be deleted or the delimiter changed)
pd.get_dummies(df_customer[['customer_id', 'gender_cd']], 
               columns=['gender_cd'], 
               drop_first=True, prefix='gen', prefix_sep='#').head(10)

Out[ ]:

	customer_id	gen#1	gen#9
0	CS021313000114	1	0
1	CS037613000071	0	1
2	CS031415000172	1	0
3	CS028811000001	1	0
4	CS001215000145	1	0
5	CS020401000016	0	0
6	CS015414000103	1	0
7	CS029403000008	0	0
8	CS015804000004	0	0
9	CS033513000180	1	0

Commentary :

This code is creating dummy variables for the categorical variable gender_cd in the df_customer DataFrame. It uses the pd.get_dummies() function from the pandas library to create a binary indicator variable for each unique category in gender_cd.

The drop_first=True argument is used to drop the first category ('0') which is the reference category. This is done to prevent multicollinearity between the dummy variables.

The prefix='gen' and prefix_sep='#' arguments are used to add a prefix to the column names of the dummy variables. In this case, the prefix is 'gen' and the separator is '#'.

The resulting DataFrame will have one column for each unique value in the gender_cd column, with a value of 1 indicating that the customer is in that category and 0 indicating they are not. The column names will be in the format gen#<gender_cd>, where <gender_cd> is the unique value in the original gender_cd column.

P-059: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), standardise the total sales amount to mean 0 and standard deviation 1 and display 10 cases with the customer ID and total sales amount. The standard deviation used for standardisation may be either the square root of the variance or the square root of the unbiased variance. However, exclude customer IDs starting with “Z”, as they represent non-members.

TIPS: 
The query() argument engine allows you to choose between 'python' and 'numberxpr', defaulting to numberxpr if installed, otherwise python is used. Furthermore, string methods cannot be used in query() unless engine='python'.

In [ ]:

# Calculated with the standard deviation of the data to use sklean's preprocessing.scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python'). \
                                   groupby('customer_id'). \
                                   agg({'amount':'sum'}).reset_index()

df_sales_amount['std_amount'] = preprocessing.scale(df_sales_amount['amount'])

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	std_amount
0	CS001113000004	1298	-0.459378
1	CS001114000005	626	-0.706390
2	CS001115000010	3044	0.182413
3	CS001205000004	1988	-0.205749
4	CS001205000006	3337	0.290114
5	CS001211000025	456	-0.768879
6	CS001212000027	448	-0.771819
7	CS001212000031	296	-0.827691
8	CS001212000046	228	-0.852686
9	CS001212000070	456	-0.768879

Commentary :

This code does the following:

Filters out the rows in df_receipt where customer_id starts with the letter "Z".

Groups the remaining rows by customer_id and calculates the sum of the amount column for each group using the agg method.

Resets the index of the resulting DataFrame to make customer_id a regular column.

Applies the preprocessing.scale function from scikit-learn to the amount column to calculate the z-score of each customer's total sales.

Adds a new column to the DataFrame called std_amount that contains the z-scores calculated in step 4.

The resulting DataFrame df_sales_amount contains one row for each customer, with columns for their customer_id, total sales amount, and std_amount, which is the z-score of their sales.

In [ ]:

# Code example 2 (by doing a fit, you can standardise different data with the same mean and standard deviation).
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python'). \
                                   groupby('customer_id'). \
                                   agg({'amount':'sum'}).reset_index()

scaler = preprocessing.StandardScaler()

scaler.fit(df_sales_amount[['amount']])

df_sales_amount['std_amount'] = scaler.transform(df_sales_amount[['amount']])

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	std_amount
0	CS001113000004	1298	-0.459378
1	CS001114000005	626	-0.706390
2	CS001115000010	3044	0.182413
3	CS001205000004	1988	-0.205749
4	CS001205000006	3337	0.290114
5	CS001211000025	456	-0.768879
6	CS001212000027	448	-0.771819
7	CS001212000031	296	-0.827691
8	CS001212000046	228	-0.852686
9	CS001212000070	456	-0.768879

Commentary :

This code performs standardization (also called Z-score normalization) on the 'amount' column of the 'df_sales_amount' DataFrame.

The code first filters out the transactions made by customers whose IDs start with the letter 'Z' using the 'query' method of pandas DataFrame. It then groups the remaining transactions by 'customer_id' and calculates the total amount spent by each customer using the 'agg' method. Finally, it resets the index of the resulting DataFrame.

Next, the code creates an instance of the 'StandardScaler' class from the 'preprocessing' module of scikit-learn library. The 'StandardScaler' class is used for standardization, which scales the data so that it has a mean of zero and a standard deviation of one. The 'fit' method of the 'scaler' instance is then used to calculate the mean and standard deviation of the 'amount' column of the 'df_sales_amount' DataFrame.

Finally, the 'transform' method of the 'scaler' instance is used to perform the actual standardization on the 'amount' column of the 'df_sales_amount' DataFrame, and the resulting standardized values are stored in a new column named 'std_amount'.

The resulting DataFrame, 'df_sales_amount', contains the 'customer_id', total 'amount' spent by each customer, and the standardized 'std_amount' column.

P-060: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), normalise the total sales amount to a minimum value of 0 and a maximum value of 1 and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

# Code example 1
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python'). \
                                   groupby('customer_id'). \
                                   agg({'amount':'sum'}).reset_index()

df_sales_amount['scale_amount'] = \
                        preprocessing.minmax_scale(df_sales_amount['amount'])

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	scale_amount
0	CS001113000004	1298	0.053354
1	CS001114000005	626	0.024157
2	CS001115000010	3044	0.129214
3	CS001205000004	1988	0.083333
4	CS001205000006	3337	0.141945
5	CS001211000025	456	0.016771
6	CS001212000027	448	0.016423
7	CS001212000031	296	0.009819
8	CS001212000046	228	0.006865
9	CS001212000070	456	0.016771

Commentary :

This code performs data preprocessing on a pandas DataFrame df_receipt. Here's what each line of the code does:

Filters out rows where the customer ID starts with "Z" using the query() method and regex engine python. The resulting DataFrame is assigned to df_sales_amount.

Groups the DataFrame df_sales_amount by customer ID using the groupby() method and calculates the sum of the amount column for each group using the agg() method. The resulting DataFrame is assigned to df_sales_amount.

Resets the index of df_sales_amount to turn the customer_id column into a regular column using the reset_index() method.

Applies the minmax_scale() function from the preprocessing module of scikit-learn to the amount column of df_sales_amount to scale the values between 0 and 1. The resulting scaled values are stored in a new column called scale_amount.

Therefore, this code computes the total amount of sales made by each customer, filters out rows for certain customers (whose IDs start with "Z"), and then scales the sales amount for each customer between 0 and 1 using min-max scaling.

In [ ]:

# Code example 2 (by doing a fit, you can standardise on the same minimum and maximum values for different data).
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', 
                                   engine='python'). \
                                   groupby('customer_id'). \
                                   agg({'amount':'sum'}).reset_index()

scaler = preprocessing.MinMaxScaler()

scaler.fit(df_sales_amount[['amount']])

df_sales_amount['scale_amount'] = scaler.transform(df_sales_amount[['amount']])

df_sales_amount.head(10)

Out[ ]:

	customer_id	amount	scale_amount
0	CS001113000004	1298	0.053354
1	CS001114000005	626	0.024157
2	CS001115000010	3044	0.129214
3	CS001205000004	1988	0.083333
4	CS001205000006	3337	0.141945
5	CS001211000025	456	0.016771
6	CS001212000027	448	0.016423
7	CS001212000031	296	0.009819
8	CS001212000046	228	0.006865
9	CS001212000070	456	0.016771

Commentary :

This code performs feature scaling on the sales amount data for each customer in a retail store. Here's what each line does:

df_sales_amount: Creates a new DataFrame that includes the customer_id and total amount spent by each customer, obtained by grouping the df_receipt DataFrame by customer_id and summing the amount for each group.

scaler: Initializes a MinMaxScaler object from the preprocessing module of scikit-learn, which will scale the data in the range [0,1].

scaler.fit(): Fits the scaler object to the amount column of df_sales_amount, computing the minimum and maximum values of the data.

df_sales_amount['scale_amount']: Creates a new column in the df_sales_amount DataFrame to store the scaled amount data.

scaler.transform(): Scales the amount data for each customer using the minimum and maximum values learned by the scaler object.

df_sales_amount.head(10): Displays the first 10 rows of the updated df_sales_amount DataFrame, including the scaled amount data in the new scale_amount column.

Overall, this code performs feature scaling to bring the values of the amount data for each customer into the same range, making them comparable and more useful for analysis.

Data Science 100 Knocks (Structured Data Processing) - Python

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.