参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Getting Started
- First, run the following cells
- Import the necessary libraries and read data from the database (PostgreSQL)
- The following cell imports libraries that are expected to be used, such as pandas.
- If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
- The process can be divided into multiple times.
- Names, addresses, etc. are dummy data and are not real.
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
if 'PG_PORT' in os.environ:
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
# Connector for pd.read_sql
conn = psycopg2.connect(**pgconfig)
df_customer = pd.read_sql(sql='select * from customer', con=conn)
df_category = pd.read_sql(sql='select * from category', con=conn)
df_product = pd.read_sql(sql='select * from product', con=conn)
df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
df_store = pd.read_sql(sql='select * from store', con=conn)
df_geocode = pd.read_sql(sql='select * from geocode', con=conn)
else:
if not os.path.exists('data/'):
!git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
os.chdir('100knocks-preprocess/docker/work/')
dtype = {
'customer_id': str,
'gender_cd': str,
'postal_cd': str,
'application_store_cd': str,
'status_cd': str,
'category_major_cd': str,
'category_medium_cd': str,
'category_small_cd': str,
'product_cd': str,
'store_cd': str,
'prefecture_cd': str,
'tel_no': str,
'postal_cd': str,
'street': str
}
df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
df_category = pd.read_csv("data/category.csv", dtype=dtype)
df_product = pd.read_csv("data/product.csv", dtype=dtype)
df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
df_store = pd.read_csv("data/store.csv", dtype=dtype)
df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)
fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.
Exercise
P-041: Sum up the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd) and calculate the increase or decrease in sales amount from the date of the last sale. And display 10 results.
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].\
groupby('sales_ymd').sum().reset_index()
df_sales_amount_by_date = pd.concat([df_sales_amount_by_date,
df_sales_amount_by_date.shift()], axis=1)
df_sales_amount_by_date.columns = ['sales_ymd','amount','lag_ymd','lag_amount']
df_sales_amount_by_date['diff_amount'] = \
df_sales_amount_by_date['amount'] - df_sales_amount_by_date['lag_amount']
df_sales_amount_by_date.head(10)
sales_ymd | amount | lag_ymd | lag_amount | diff_amount | |
---|---|---|---|---|---|
0 | 20170101 | 33723 | NaN | NaN | NaN |
1 | 20170102 | 24165 | 20170101.0 | 33723.0 | -9558.0 |
2 | 20170103 | 27503 | 20170102.0 | 24165.0 | 3338.0 |
3 | 20170104 | 36165 | 20170103.0 | 27503.0 | 8662.0 |
4 | 20170105 | 37830 | 20170104.0 | 36165.0 | 1665.0 |
5 | 20170106 | 32387 | 20170105.0 | 37830.0 | -5443.0 |
6 | 20170107 | 23415 | 20170106.0 | 32387.0 | -8972.0 |
7 | 20170108 | 24737 | 20170107.0 | 23415.0 | 1322.0 |
8 | 20170109 | 26718 | 20170108.0 | 24737.0 | 1981.0 |
9 | 20170110 | 20143 | 20170109.0 | 26718.0 | -6575.0 |
Commentary :
This code calculates the daily sales amount of a company and then creates a new dataframe by shifting the sales amount by one day, resulting in a new column for the previous day's sales amount. Then, it calculates the difference between the sales amount of a day and the sales amount of the previous day and creates a new column called 'diff_amount'.
Here's a line-by-line explanation:df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index()
: This code selects the columns 'sales_ymd' and 'amount' from the 'df_receipt' dataframe and groups the resulting data by the 'sales_ymd' column. It then calculates the sum of the 'amount' column for each 'sales_ymd' group and resets the index to create a new dataframe 'df_sales_amount_by_date' with the columns 'sales_ymd' and 'amount'.df_sales_amount_by_date = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift()], axis=1)
: This code concatenates the 'df_sales_amount_by_date' dataframe with a shifted version of itself along the columns axis (i.e., horizontally). This results in a new dataframe with four columns: 'sales_ymd', 'amount', 'lag_ymd', and 'lag_amount'.df_sales_amount_by_date.columns = ['sales_ymd','amount','lag_ymd','lag_amount']
: This code renames the columns of the 'df_sales_amount_by_date' dataframe to more meaningful names.df_sales_amount_by_date['diff_amount'] = df_sales_amount_by_date['amount'] - df_sales_amount_by_date['lag_amount']
: This code creates a new column 'diff_amount' in the 'df_sales_amount_by_date' dataframe by subtracting the 'lag_amount' column from the 'amount' column.df_sales_amount_by_date.head(10)
: This code displays the first 10 rows of the 'df_sales_amount_by_date' dataframe.
P-042: Tabulate the sales amount (amount) of the receipt details data (df_receipt) by date (sales_ymd), and for each date, combine the data of the last, previous and three previous sales dates. Then display 10 results.
# Code example 1: Vertical holding case
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']]. \
groupby('sales_ymd').sum().reset_index()
for i in range(1, 4):
df_tmp = pd.concat([df_sales_amount_by_date,
df_sales_amount_by_date.shift(i)], axis=1)
if i == 1:
df_lag = df_tmp
else:
df_lag = df_lag.append(df_tmp)
df_lag.columns = ['sales_ymd', 'amount', 'lag_ymd', 'lag_amount']
df_lag.dropna().astype(int).sort_values(['sales_ymd','lag_ymd']).head(10)
sales_ymd | amount | lag_ymd | lag_amount | |
---|---|---|---|---|
1 | 20170102 | 24165 | 20170101 | 33723 |
2 | 20170103 | 27503 | 20170101 | 33723 |
2 | 20170103 | 27503 | 20170102 | 24165 |
3 | 20170104 | 36165 | 20170101 | 33723 |
3 | 20170104 | 36165 | 20170102 | 24165 |
3 | 20170104 | 36165 | 20170103 | 27503 |
4 | 20170105 | 37830 | 20170102 | 24165 |
4 | 20170105 | 37830 | 20170103 | 27503 |
4 | 20170105 | 37830 | 20170104 | 36165 |
5 | 20170106 | 32387 | 20170103 | 27503 |
Commentary :
This code is performing a lag analysis on the sales data.
First, the code groups the sales data by date and sums the sales amount for each date to get the total sales amount for each day. This is stored indf_sales_amount_by_date
.
Then, the code enters a for loop that iterates three times, from 1 to 3 (inclusive).
Within each iteration of the for loop, the code creates a new DataFrame calleddf_tmp
by concatenating thedf_sales_amount_by_date
DataFrame with itself shifted byi
rows. This results in a DataFrame where each row contains the sales data for a specific date and the sales data for thei
th previous date.
Ifi
is equal to 1, then the resulting DataFrame is stored indf_lag
. Otherwise, the DataFrame is appended todf_lag
.
The columns ofdf_lag
are then renamed tosales_ymd
,amount
,lag_ymd
, andlag_amount
to indicate the date and amount of sales for the current date and the lagged date.
Finally, the code drops any rows containing NaN values (which would occur for the firsti
rows) and converts the sales amounts to integers before sorting the DataFrame bysales_ymd
andlag_ymd
and displaying the first 10 rows.
This provides a summary of the total sales amount for each date and the total sales amounts for the 1st, 2nd, and 3rd previous dates, allowing the user to perform lag analysis on the sales data.
# Code example 2: Horizontal holding case
df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].\
groupby('sales_ymd').sum().reset_index()
df_lag = df_sales_amount_by_date
for i in range(1, 4):
df_lag = pd.concat([df_lag, df_sales_amount_by_date.shift(i)], axis=1)
columns = [f'lag_ymd_{i}', f'lag_amount_{i}']
df_lag.columns = list(df_lag.columns)[:-len(columns)] + columns
df_lag.dropna().astype(int).sort_values(['sales_ymd']).head(10)
sales_ymd | amount | lag_ymd_1 | lag_amount_1 | lag_ymd_2 | lag_amount_2 | lag_ymd_3 | lag_amount_3 | |
---|---|---|---|---|---|---|---|---|
3 | 20170104 | 36165 | 20170103 | 27503 | 20170102 | 24165 | 20170101 | 33723 |
4 | 20170105 | 37830 | 20170104 | 36165 | 20170103 | 27503 | 20170102 | 24165 |
5 | 20170106 | 32387 | 20170105 | 37830 | 20170104 | 36165 | 20170103 | 27503 |
6 | 20170107 | 23415 | 20170106 | 32387 | 20170105 | 37830 | 20170104 | 36165 |
7 | 20170108 | 24737 | 20170107 | 23415 | 20170106 | 32387 | 20170105 | 37830 |
8 | 20170109 | 26718 | 20170108 | 24737 | 20170107 | 23415 | 20170106 | 32387 |
9 | 20170110 | 20143 | 20170109 | 26718 | 20170108 | 24737 | 20170107 | 23415 |
10 | 20170111 | 24287 | 20170110 | 20143 | 20170109 | 26718 | 20170108 | 24737 |
11 | 20170112 | 23526 | 20170111 | 24287 | 20170110 | 20143 | 20170109 | 26718 |
12 | 20170113 | 28004 | 20170112 | 23526 | 20170111 | 24287 | 20170110 | 20143 |
Commentary :
The code provided performs the following operations:
Thedf_receipt
DataFrame is used to select thesales_ymd
andamount
columns, then grouped bysales_ymd
and thesum()
function is applied to theamount
column to calculate the total sales amount for each day. The resulting DataFrame is stored indf_sales_amount_by_date
.
A new DataFrame calleddf_lag
is created as a copy ofdf_sales_amount_by_date
.
A loop is executed three times (from 1 to 4) and in each iteration, theshift()
method is used to shift thedf_sales_amount_by_date
DataFrame byi
periods (days). The resulting DataFrame is concatenated withdf_lag
horizontally usingpd.concat()
, and the resulting DataFrame is stored indf_lag
.
Thecolumns
variable is created to store the names of the new columns that will be added to thedf_lag
DataFrame.
Thecolumns
names ofdf_lag
are updated using thecolumns
list, and the resulting DataFrame is stored indf_lag
.
Thedropna()
method is called to remove any rows with missing values, and theastype()
method is used to convert the DataFrame to integer type. Finally, thesort_values()
method is used to sort the DataFrame in ascending order bysales_ymd
, and the first 10 rows of the resulting DataFrame are returned using thehead()
method.
In summary, the code is creating a DataFramedf_lag
with the sales amount for each day and also the sales amount for the previous 3 days, to identify trends in sales performance. The resulting DataFrame contains the sales amount for each day along with the sales amount for the previous 3 days, sorted by sales date.
P-043: Combine receipt details data (df_receipt) and customer data (df_customer) to create sales summary data, summing the sales amount (amount) by gender code (gender_cd) and age (calculated from age). Gender_cd should be 0 for male, 1 for female and 9 for unknown.
However, the item structure should consist of four items: age, female sales amount, male sales amount and sales amount of unknown gender (vertical cross tabulation of age and horizontal cross tabulation of gender). In addition, the age shall be in deciles of 10 years.
# Code example 1
df_tmp = pd.merge(df_receipt, df_customer, how ='inner', on="customer_id")
df_tmp['era'] = df_tmp['age'].apply(lambda x: math.floor(x / 10) * 10)
df_sales_summary = pd.pivot_table(
df_tmp, index='era',
columns='gender_cd',
values='amount',
aggfunc='sum'
).reset_index()
df_sales_summary.columns = ['era', 'male', 'female', 'unknown']
df_sales_summary
era | male | female | unknown | |
---|---|---|---|---|
0 | 10 | 1591.0 | 149836.0 | 4317.0 |
1 | 20 | 72940.0 | 1363724.0 | 44328.0 |
2 | 30 | 177322.0 | 693047.0 | 50441.0 |
3 | 40 | 19355.0 | 9320791.0 | 483512.0 |
4 | 50 | 54320.0 | 6685192.0 | 342923.0 |
5 | 60 | 272469.0 | 987741.0 | 71418.0 |
6 | 70 | 13435.0 | 29764.0 | 2427.0 |
7 | 80 | 46360.0 | 262923.0 | 5111.0 |
8 | 90 | NaN | 6260.0 | NaN |
Commentary :
This code does the following:
Merges two dataframesdf_receipt
anddf_customer
on the common columncustomer_id
, with an inner join, and saves the result to a new dataframedf_tmp
.
Creates a new columnera
in thedf_tmp
dataframe, which is calculated by dividing the age of each customer by 10 and then rounding down to the nearest 10.
Calculates the sum of theamount
column of thedf_tmp
dataframe for each unique combination ofera
andgender_cd
values, and saves the result as a new dataframedf_sales_summary
using thepd.pivot_table()
function. The resulting dataframe hasera
values as the index andgender_cd
values as the columns.
Renames the columns of thedf_sales_summary
dataframe to be more descriptive, withera
as the first column, followed bymale
,female
, andunknown
, which represent the sum of theamount
column for each respectivegender_cd
value.
# Code example 2
df_tmp = pd.merge(df_receipt, df_customer, how ='inner', on="customer_id")
df_tmp['era'] = np.floor(df_tmp['age'] / 10).astype(int) * 10
df_sales_summary = pd.pivot_table(df_tmp, index='era', columns='gender_cd',
values='amount', aggfunc='sum').reset_index()
df_sales_summary.columns = ['era', 'male', 'female', 'unknown']
df_sales_summary
era | male | female | unknown | |
---|---|---|---|---|
0 | 10 | 1591.0 | 149836.0 | 4317.0 |
1 | 20 | 72940.0 | 1363724.0 | 44328.0 |
2 | 30 | 177322.0 | 693047.0 | 50441.0 |
3 | 40 | 19355.0 | 9320791.0 | 483512.0 |
4 | 50 | 54320.0 | 6685192.0 | 342923.0 |
5 | 60 | 272469.0 | 987741.0 | 71418.0 |
6 | 70 | 13435.0 | 29764.0 | 2427.0 |
7 | 80 | 46360.0 | 262923.0 | 5111.0 |
8 | 90 | NaN | 6260.0 | NaN |
Commentary :
This code performs the following operations:
It merges two dataframes,df_receipt
anddf_customer
on the common column 'customer_id' using an inner join. The resulting dataframe is stored in a new variable calleddf_tmp
.
It creates a new column 'era' indf_tmp
by dividing the 'age' column by 10 and taking the floor, then multiplying by 10 to get the decade. This column represents the age group of each customer in decades.
It creates a pivot table calleddf_sales_summary
by using thepd.pivot_table()
function. The pivot table is based on thedf_tmp
dataframe and has rows indexed by the 'era' column, columns grouped by the 'gender_cd' column, and values computed from the 'amount' column using the sum aggregation function. The resulting table shows the total sales amount for each gender in each age group.
It renames the columns indf_sales_summary
to be more descriptive, with 'male', 'female', and 'unknown' representing the total sales amounts for each gender in each age group.
P-044: The sales summary data (df_sales_summary) created in 043 had sales by gender carried horizontally. From this data, convert the gender into three items: age, gender code and sales amount, by having the gender carried vertically. The gender code should be “00” for male, “01” for female and “99” for unknown.
df_sales_summary.set_index('era'). \
stack().reset_index().replace({'female':'01','male':'00','unknown':'99'}). \
rename(columns={'level_1':'gender_cd', 0: 'amount'})
era | gender_cd | amount | |
---|---|---|---|
0 | 10 | 00 | 1591.0 |
1 | 10 | 01 | 149836.0 |
2 | 10 | 99 | 4317.0 |
3 | 20 | 00 | 72940.0 |
4 | 20 | 01 | 1363724.0 |
5 | 20 | 99 | 44328.0 |
6 | 30 | 00 | 177322.0 |
7 | 30 | 01 | 693047.0 |
8 | 30 | 99 | 50441.0 |
9 | 40 | 00 | 19355.0 |
10 | 40 | 01 | 9320791.0 |
11 | 40 | 99 | 483512.0 |
12 | 50 | 00 | 54320.0 |
13 | 50 | 01 | 6685192.0 |
14 | 50 | 99 | 342923.0 |
15 | 60 | 00 | 272469.0 |
16 | 60 | 01 | 987741.0 |
17 | 60 | 99 | 71418.0 |
18 | 70 | 00 | 13435.0 |
19 | 70 | 01 | 29764.0 |
20 | 70 | 99 | 2427.0 |
21 | 80 | 00 | 46360.0 |
22 | 80 | 01 | 262923.0 |
23 | 80 | 99 | 5111.0 |
24 | 90 | 01 | 6260.0 |
Commentary :
This code transforms a pandas DataFramedf_sales_summary
by setting the index toera
, stacking the columnsfemale
,male
, andunknown
to a new columngender_cd
, and then resetting the index. It then replaces thegender_cd
values with01
forfemale
,00
formale
, and99
forunknown
. Finally, it renames the columnslevel_1
togender_cd
and0
toamount
.
Here is a step-by-step explanation:df_sales_summary.set_index('era')
: This sets the index ofdf_sales_summary
to the columnera
. This makes it easier to group and pivot the data based onera
..stack()
: This stacks the columnsfemale
,male
, andunknown
into a single column, with a new index level added for the stacked column..reset_index()
: This resets the index of the DataFrame to default integer index..replace({'female':'01','male':'00','unknown':'99'})
: This replaces the values in thegender_cd
column with the corresponding codes..rename(columns={'level_1':'gender_cd', 0: 'amount'})
: This renames the columnslevel_1
togender_cd
and0
toamount
. Thelevel_1
column was created by thestack()
method, and contains the original column names of the stacked columns.
P-045: The date of birth (birth_day) in the customer data (df_customer) is held in date type. Convert this to a string in YYYYMMDD format and display 10 cases together with the customer ID (customer_id).
# You can convert to a YYYYYMMDD format string by writing
# pd.to_datetime(df_customer['birth_day']).dt.strftime('%Y%m%d')
pd.concat([df_customer['customer_id'],
pd.to_datetime(df_customer['birth_day']).dt.strftime('%Y%m%d')],
axis = 1).head(10)
customer_id | birth_day | |
---|---|---|
0 | CS021313000114 | 19810429 |
1 | CS037613000071 | 19520401 |
2 | CS031415000172 | 19761004 |
3 | CS028811000001 | 19330327 |
4 | CS001215000145 | 19950329 |
5 | CS020401000016 | 19740915 |
6 | CS015414000103 | 19770809 |
7 | CS029403000008 | 19730817 |
8 | CS015804000004 | 19310502 |
9 | CS033513000180 | 19620711 |
Commentary :
This code creates a new DataFrame by concatenating two columns of the existingdf_customer
DataFrame: thecustomer_id
column and thebirth_day
column, which contains date information.
Thepd.to_datetime()
function is applied to thebirth_day
column to convert it to a datetime format, and then thedt.strftime('%Y%m%d')
method is applied to format the datetime as a string in the format'%Y%m%d'
(i.e., year, month, day).
This creates a new column in the new DataFrame with a string representation of the customer's birth date. Finally, thehead()
method is used to display the first 10 rows of the new DataFrame.
P-046: The application date (application_date) of the customer data (df_customer) holds data in a string type in YYYYMMDD format. Convert this to a date type and display 10 cases together with the customer ID (customer_id).
pd.concat([df_customer['customer_id'],
pd.to_datetime(df_customer['application_date'])], axis=1).head(10)
customer_id | application_date | |
---|---|---|
0 | CS021313000114 | 1970-01-01 00:00:00.020150905 |
1 | CS037613000071 | 1970-01-01 00:00:00.020150414 |
2 | CS031415000172 | 1970-01-01 00:00:00.020150529 |
3 | CS028811000001 | 1970-01-01 00:00:00.020160115 |
4 | CS001215000145 | 1970-01-01 00:00:00.020170605 |
5 | CS020401000016 | 1970-01-01 00:00:00.020150225 |
6 | CS015414000103 | 1970-01-01 00:00:00.020150722 |
7 | CS029403000008 | 1970-01-01 00:00:00.020150515 |
8 | CS015804000004 | 1970-01-01 00:00:00.020150607 |
9 | CS033513000180 | 1970-01-01 00:00:00.020150728 |
Commentary :
This code snippet concatenates two columns from a pandas DataFramedf_customer
-'customer_id'
and'application_date'
- into a new DataFrame using thepd.concat()
method.
Thepd.to_datetime()
method is used to convert the'application_date'
column to a datetime object. This method is a convenient way to convert date strings in various formats to a standard datetime object in pandas.
The resulting DataFrame has two columns -'customer_id'
and'application_date'
- with the dates in datetime format.
P-047: The sales date (sales_ymd) of the receipt details data (df_receipt) holds data in numeric type in YYYYMMDD format. Convert this to a date type and display 10 items together with the receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
pd.to_datetime(df_receipt['sales_ymd'].astype('str'))],
axis=1).head(10)
receipt_no | receipt_sub_no | sales_ymd | |
---|---|---|---|
0 | 112 | 1 | 2018-11-03 |
1 | 1132 | 2 | 2018-11-18 |
2 | 1102 | 1 | 2017-07-12 |
3 | 1132 | 1 | 2019-02-05 |
4 | 1102 | 2 | 2018-08-21 |
5 | 1112 | 1 | 2019-06-05 |
6 | 1102 | 2 | 2018-12-05 |
7 | 1102 | 1 | 2019-09-22 |
8 | 1112 | 2 | 2017-05-04 |
9 | 1102 | 1 | 2019-10-10 |
Commentary :
This code is usingpd.concat
function to combine two pandas dataframes:df_receipt[['receipt_no', 'receipt_sub_no']]
: This selects the columnsreceipt_no
andreceipt_sub_no
from thedf_receipt
dataframe.pd.to_datetime(df_receipt['sales_ymd'].astype('str'))
: This converts thesales_ymd
column indf_receipt
dataframe to a datetime format.
Theaxis=1
argument inpd.concat
indicates that the dataframes should be concatenated horizontally, column-wise. The resulting dataframe has the columnsreceipt_no
,receipt_sub_no
andsales_ymd
(as a datetime format), and the first 10 rows are displayed using.head(10)
method.
P-048: The sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) holds data in UNIX seconds of numeric type. Convert it to date type and display 10 cases together with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
pd.to_datetime(df_receipt['sales_epoch'], unit='s').rename('sales_ymd')],
axis=1).head(10)
receipt_no | receipt_sub_no | sales_ymd | |
---|---|---|---|
0 | 112 | 1 | 2018-11-03 |
1 | 1132 | 2 | 2018-11-18 |
2 | 1102 | 1 | 2017-07-12 |
3 | 1132 | 1 | 2019-02-05 |
4 | 1102 | 2 | 2018-08-21 |
5 | 1112 | 1 | 2019-06-05 |
6 | 1102 | 2 | 2018-12-05 |
7 | 1102 | 1 | 2019-09-22 |
8 | 1112 | 2 | 2017-05-04 |
9 | 1102 | 1 | 2019-10-10 |
Commentary :
This code performs the following operations:
It selects two columns from thedf_receipt
DataFrame:receipt_no
andreceipt_sub_no
.
It converts thesales_epoch
column to a datetime format using theto_datetime()
function of the pandas library. Theunit
parameter is set to's'
to indicate that the input is in seconds.
The resulting datetime column is then renamed to'sales_ymd'
using therename()
function of the pandas library.
Finally, the two DataFrames are concatenated along the columns axis (axis=1
) using theconcat()
function of the pandas library.
The resulting DataFrame will have three columns:receipt_no
,receipt_sub_no
, andsales_ymd
, wheresales_ymd
is the sales date and time in a datetime format. Thehead(10)
function is used to display the first 10 rows of the resulting DataFrame.
P-049: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “年(year)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no).
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
pd.to_datetime(df_receipt['sales_epoch'],
unit='s').dt.year.rename('sales_year')],
axis=1).head(10)
receipt_no | receipt_sub_no | sales_year | |
---|---|---|---|
0 | 112 | 1 | 2018 |
1 | 1132 | 2 | 2018 |
2 | 1102 | 1 | 2017 |
3 | 1132 | 1 | 2019 |
4 | 1102 | 2 | 2018 |
5 | 1112 | 1 | 2019 |
6 | 1102 | 2 | 2018 |
7 | 1102 | 1 | 2019 |
8 | 1112 | 2 | 2017 |
9 | 1102 | 1 | 2019 |
Commentary :
This code creates a new DataFrame by concatenating the columnsreceipt_no
andreceipt_sub_no
fromdf_receipt
with the year of thesales_epoch
column converted to a datetime object using theto_datetime
method with theunit
parameter set tos
to indicate that the values insales_epoch
are Unix timestamps (i.e., the number of seconds since January 1, 1970).
Here's a breakdown of the code line by line:pd.concat
: concatenates the two DataFrames along the columns axisdf_receipt[['receipt_no', 'receipt_sub_no']]
: selects the columnsreceipt_no
andreceipt_sub_no
from thedf_receipt
DataFramepd.to_datetime(df_receipt['sales_epoch'], unit='s')
: converts the values in thesales_epoch
column to a datetime object using theto_datetime
method with theunit
parameter set tos
dt.year
: extracts the year component of the datetime objectrename('sales_year')
: renames the resulting column tosales_year
axis=1
: specifies that the concatenation should be performed along the columns axis.head(10)
: returns the first 10 rows of the resulting DataFrame.
P-050: Convert the sales epoch seconds (sales_epoch) of the receipt details data (df_receipt) to date type, extract only the “月(month)” and display 10 items with receipt number (receipt_no) and receipt sub number (receipt_sub_no). Note that the “月(month)” should be retrieved with two digits filled with zero.
# dt.month can also be used to get the month, but here strftime is used to retrieve the month with two digits filled with zero.
df_datetime = pd.to_datetime(df_receipt['sales_epoch'],
unit='s').rename('sales_month')
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
df_datetime.dt.strftime('%m')],axis=1).head(10)
receipt_no | receipt_sub_no | sales_month | |
---|---|---|---|
0 | 112 | 1 | 11 |
1 | 1132 | 2 | 11 |
2 | 1102 | 1 | 07 |
3 | 1132 | 1 | 02 |
4 | 1102 | 2 | 08 |
5 | 1112 | 1 | 06 |
6 | 1102 | 2 | 12 |
7 | 1102 | 1 | 09 |
8 | 1112 | 2 | 05 |
9 | 1102 | 1 | 10 |
Commentary :
The given code performs the following operations:
Extracts the 'sales_epoch' column from the 'df_receipt' dataframe.
Converts the 'sales_epoch' values into datetime format using the 'to_datetime' function with the 'unit' parameter set to 's' for seconds.
Renames the resultant series to 'sales_month' using the 'rename' function.
Extracts the month value from the 'sales_month' column using the 'strftime' function with the '%m' format.
Concatenates the resultant series with the 'receipt_no' and 'receipt_sub_no' columns of the 'df_receipt' dataframe using the 'concat' function.
The resulting dataframe will contain the 'receipt_no', 'receipt_sub_no', and 'sales_month' columns where 'sales_month' will have the month values extracted from the 'sales_epoch' column.
P-051: Convert the sales epoch seconds of the receipt details data (df_receipt) to date type, extract only the “日(day)” and display 10 items with receipt number (receipt_no) and receipt sub-number (receipt_sub_no). Note that the “日(day)” should be extracted with two digits filled with zero.
# dt.day can also be used to get the day, but here strftime is used to get the day with two digits filled with zero.
df_datetime = pd.to_datetime(df_receipt['sales_epoch'],
unit='s').rename('sales_day')
pd.concat([df_receipt[['receipt_no', 'receipt_sub_no']],
df_datetime.dt.strftime('%d')], axis=1).head(10)
receipt_no | receipt_sub_no | sales_day | |
---|---|---|---|
0 | 112 | 1 | 03 |
1 | 1132 | 2 | 18 |
2 | 1102 | 1 | 12 |
3 | 1132 | 1 | 05 |
4 | 1102 | 2 | 21 |
5 | 1112 | 1 | 05 |
6 | 1102 | 2 | 05 |
7 | 1102 | 1 | 22 |
8 | 1112 | 2 | 04 |
9 | 1102 | 1 | 10 |
Commentary :
This code takes thesales_epoch
column from thedf_receipt
DataFrame, which contains the number of seconds since January 1, 1970 (also known as Unix time), and converts it into a Pandas datetime object with second precision usingpd.to_datetime()
. The resulting datetime object is assigned to a new column calledsales_day.
Thestrftime()
method is then applied to thesales_day
column to extract the day component of the date in the format of two digits with leading zeros, denoted by%d
. The resulting string is concatenated with thereceipt_no
andreceipt_sub_no
columns usingpd.concat()
, resulting in a new DataFrame with the three columns.
Finally, thehead()
method is called on the resulting DataFrame to display the first 10 rows. This code essentially extracts the day of the month from thesales_epoch
column ofdf_receipt
, and combines it with the receipt numbers to produce a new DataFrame.
P-052: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), binaryise the total sales amount to 0 for amounts less than 2,000 yen and 1 for amounts greater than 2,000 yen, and display 10 items with the customer ID and total sales amount. Display 10 items together with the customer ID and the total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
# Code example 1
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python')
df_sales_amount = df_sales_amount[['customer_id', 'amount']]. \
groupby('customer_id').sum().reset_index()
df_sales_amount['sales_flg'] = df_sales_amount['amount']. \
apply(lambda x: 1 if x > 2000 else 0)
df_sales_amount.head(10)
customer_id | amount | sales_flg | |
---|---|---|---|
0 | CS001113000004 | 1298 | 0 |
1 | CS001114000005 | 626 | 0 |
2 | CS001115000010 | 3044 | 1 |
3 | CS001205000004 | 1988 | 0 |
4 | CS001205000006 | 3337 | 1 |
5 | CS001211000025 | 456 | 0 |
6 | CS001212000027 | 448 | 0 |
7 | CS001212000031 | 296 | 0 |
8 | CS001212000046 | 228 | 0 |
9 | CS001212000070 | 456 | 0 |
Commentary :
The code performs the following operations:
Selects rows fromdf_receipt
wherecustomer_id
does not start with the letter 'Z' using thequery
method and assigns it todf_sales_amount
.
Selects thecustomer_id
andamount
columns fromdf_sales_amount
and calculates the sum ofamount
for eachcustomer_id
using thegroupby
method. The result is assigned todf_sales_amount
.
Adds a new columnsales_flg
todf_sales_amount
which contains 1 if theamount
is greater than 2000 and 0 otherwise, using theapply
method with a lambda function.
Displays the first 10 rows of the resulting DataFramedf_sales_amount
.
So, the resulting DataFramedf_sales_amount
contains thecustomer_id
, totalamount
spent by each customer and a binary flagsales_flg
indicating whether the customer's spending is high (greater than 2000) or not.
# Code example 2 (use of np.where)
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python')
df_sales_amount = df_sales_amount[['customer_id', 'amount']]. \
groupby('customer_id').sum().reset_index()
df_sales_amount['sales_flg'] = np.where(df_sales_amount['amount'] > 2000, 1, 0)
df_sales_amount.head(10)
customer_id | amount | sales_flg | |
---|---|---|---|
0 | CS001113000004 | 1298 | 0 |
1 | CS001114000005 | 626 | 0 |
2 | CS001115000010 | 3044 | 1 |
3 | CS001205000004 | 1988 | 0 |
4 | CS001205000006 | 3337 | 1 |
5 | CS001211000025 | 456 | 0 |
6 | CS001212000027 | 448 | 0 |
7 | CS001212000031 | 296 | 0 |
8 | CS001212000046 | 228 | 0 |
9 | CS001212000070 | 456 | 0 |
Commentary :
This code is generating a summary table of customer sales information. Here's what each line does:df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
: This line filters the receipt data to exclude customers whose IDs start with "Z". Thequery
method is used here to apply the filter, andengine='python'
is specified to avoid a warning message. The filtered data is assigned to a new dataframe calleddf_sales_amount
.df_sales_amount = df_sales_amount[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
: This line groups thedf_sales_amount
dataframe bycustomer_id
, calculates the sum of theamount
column for each group, and resets the index to makecustomer_id
a column again. The result is a dataframe that shows the total sales amount for each customer.df_sales_amount['sales_flg'] = np.where(df_sales_amount['amount'] > 2000, 1, 0)
: This line adds a new column to thedf_sales_amount
dataframe calledsales_flg
. The values in this column are set based on whether the customer's total sales amount (amount
) is greater than 2000 or not. Thenp.where
function is used here to apply the condition and assign either 1 or 0 to the new column.df_sales_amount.head(10)
: This line displays the first 10 rows of thedf_sales_amount
dataframe. The resulting table shows thecustomer_id
, totalamount
of sales, and whether the customer'ssales_flg
is 1 (if their sales amount was greater than 2000) or 0 (if their sales amount was less than or equal to 2000).
P-053: Binarise the postcode (postal_cd) of the customer data (df_customer) to 1 for Tokyo (the first three digits are 100 to 209) and 0 for all others. Further, combine it with the receipt details data (df_receipt) and count the number of customers with sales results for the entire period, for each binary value created.
# Code example 1
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()
df_tmp['postal_flg'] = df_tmp['postal_cd']. \
apply(lambda x: 1 if 100 <= int(x[0:3]) <= 209 else 0)
pd.merge(df_tmp, df_receipt, how='inner', on='customer_id'). \
groupby('postal_flg').agg({'customer_id':'nunique'})
customer_id | |
---|---|
postal_flg | |
0 | 3906 |
1 | 4400 |
Commentary :
This code performs the following tasks:
Creates a copy of the "customer_id" and "postal_cd" columns from the "df_customer" DataFrame and assigns it to a new DataFrame called "df_tmp".
Creates a new column "postal_flg" in "df_tmp" based on the first 3 digits of the "postal_cd" column. If the first 3 digits are between 100 and 209 (inclusive), "postal_flg" is set to 1, otherwise it is set to 0.
Merges "df_tmp" and "df_receipt" on the "customer_id" column using an inner join.
Groups the resulting DataFrame by the "postal_flg" column and counts the number of unique customer IDs in each group using the "nunique()" function.
Returns the resulting DataFrame showing the count of unique customer IDs grouped by the "postal_flg" column.
In summary, this code is used to analyze the number of customers living in specific postal code regions based on their purchase history.
# Code example 2 (np.where, use of between)
df_tmp = df_customer[['customer_id', 'postal_cd']].copy()
df_tmp['postal_flg'] = np.where(df_tmp['postal_cd'].str[0:3].astype(int)
.between(100, 209), 1, 0)
pd.merge(df_tmp, df_receipt, how='inner', on='customer_id'). \
groupby('postal_flg').agg({'customer_id':'nunique'})
customer_id | |
---|---|
postal_flg | |
0 | 3906 |
1 | 4400 |
Commentary :
This code is used to create a new data framedf_tmp
with apostal_flg
column that indicates whether the customer's postal code is within a certain range or not. Here is the step-by-step explanation of the code:df_tmp = df_customer[['customer_id', 'postal_cd']].copy()
creates a new data framedf_tmp
containing only thecustomer_id
andpostal_cd
columns fromdf_customer
.df_tmp['postal_flg'] = np.where(df_tmp['postal_cd'].str[0:3].astype(int).between(100, 209), 1, 0)
creates a new columnpostal_flg
indf_tmp
using thenp.where
function. Thestr[0:3]
code extracts the first three characters of the postal code as a string, andastype(int)
converts it to an integer. Thebetween
function checks whether the integer falls within the range 100 to 209. If it does,1
is assigned topostal_flg
, otherwise0
is assigned.pd.merge(df_tmp, df_receipt, how='inner', on='customer_id')
mergesdf_tmp
withdf_receipt
on thecustomer_id
column, keeping only the rows where the customer is in both data frames.groupby('postal_flg').agg({'customer_id':'nunique'})
groups the merged data frame bypostal_flg
and counts the number of uniquecustomer_id
values in each group using thenunique
function. This returns a data frame with one row for eachpostal_flg
, showing the number of customers in each group.
P-054: The address of the customer data (df_customer) is one of 埼玉県(Saitama-prefecture), 千葉県(Chiba-prefecture), 東京都(Tokyo-Metropolitan) or 神奈川県(Kanagawa-prefecture). Create a code value for each prefecture, and display 10 cases together with the customer ID and address. The values should be 11 for 埼玉県(Saitama-prefecture), 12 for 千葉県(Chiba-prefecture), 13 for 東京都(Tokyo-Metropolitan) and 14 for 神奈川県(Kanagawa-prefecture).
# Code example 1 (cut out at fixed)
df_customer_tmp = df_customer[['customer_id', 'address']].copy()
df_customer_tmp['prefecture_cd'] = \
df_customer['address'].str[0:3].map({'埼玉県': '11',
'千葉県':'12',
'東京都':'13',
'神奈川':'14'})
df_customer_tmp.head(10)
customer_id | address | prefecture_cd | |
---|---|---|---|
0 | CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
1 | CS037613000071 | 東京都江東区南砂********** | 13 |
2 | CS031415000172 | 東京都渋谷区代々木********** | 13 |
3 | CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
4 | CS001215000145 | 東京都大田区仲六郷********** | 13 |
5 | CS020401000016 | 東京都板橋区若木********** | 13 |
6 | CS015414000103 | 東京都江東区北砂********** | 13 |
7 | CS029403000008 | 千葉県浦安市海楽********** | 12 |
8 | CS015804000004 | 東京都江東区北砂********** | 13 |
9 | CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This code is extracting the prefecture code from the address column of the df_customer
DataFrame and creating a new DataFrame df_customer_tmp
with columns 'customer_id'
, 'address'
, and 'prefecture_cd'
.
The code first creates a copy of the 'customer_id'
and 'address'
columns of the df_customer
DataFrame using the copy()
method and assigns it to the df_customer_tmp
variable.
df_customer_tmp = df_customer[['customer_id', 'address']].copy()
The code then extracts the first 3 characters of the 'address'
column using the str
accessor and assigns it to a new column 'prefecture_cd'
using the map()
method.
df_customer_tmp['prefecture_cd'] = df_customer['address'].str[0:3].map({'埼玉県': '11', '千葉県':'12', '東京都':'13', '神奈川':'14'})
The map()
method maps each prefecture name to its corresponding prefecture code. The resulting DataFrame df_customer_tmp
has columns 'customer_id'
, 'address'
, and 'prefecture_cd'
.
# Code example 2 (using regular expressions)
df_customer_tmp = df_customer[['customer_id', 'address']].copy()
df_customer_tmp['prefecture_cd'] = \
df_customer['address'].str.extract(r'(^.*?[都道府県])')[0].\
map({'埼玉県': '11',
'千葉県':'12',
'東京都':'13',
'神奈川県':'14'})
df_customer_tmp.head(10)
customer_id | address | prefecture_cd | |
---|---|---|---|
0 | CS021313000114 | 神奈川県伊勢原市粟窪********** | 14 |
1 | CS037613000071 | 東京都江東区南砂********** | 13 |
2 | CS031415000172 | 東京都渋谷区代々木********** | 13 |
3 | CS028811000001 | 神奈川県横浜市泉区和泉町********** | 14 |
4 | CS001215000145 | 東京都大田区仲六郷********** | 13 |
5 | CS020401000016 | 東京都板橋区若木********** | 13 |
6 | CS015414000103 | 東京都江東区北砂********** | 13 |
7 | CS029403000008 | 千葉県浦安市海楽********** | 12 |
8 | CS015804000004 | 東京都江東区北砂********** | 13 |
9 | CS033513000180 | 神奈川県横浜市旭区善部町********** | 14 |
Commentary :
This code creates a new DataFramedf_customer_tmp
by selecting columnscustomer_id
andaddress
from an existing DataFramedf_customer
.
Next, it extracts the prefecture code from theaddress
column using regular expressions. Specifically, it usesstr.extract()
method to extract the substring that matches the pattern(^.*?[都道府県])
, which means "match any characters from the start of the string up to the first occurrence of '都' or '道' or '府' or '県'".
Then, it maps the extracted prefecture names to corresponding prefecture codes and creates a new column namedprefecture_cd
in thedf_customer_tmp
DataFrame. The mapping is done using themap()
method with a dictionary containing prefecture names as keys and their corresponding codes as values.
The resulting DataFramedf_customer_tmp
has three columns:customer_id
,address
, andprefecture_cd
, whereprefecture_cd
is a two-digit string representing the prefecture code. Thehead(10)
method is used to display the first 10 rows of the DataFrame.
P-055: Sum the sales amount (amount) of the receipt details (df_receipt) data for each customer ID (customer_id) and find the quartile point of the total amount. Then, create category values for the total sales amount for each customer based on the following criteria, and display the 10 results together with the customer ID and the total sales amount. The category values shall be 1-4 in order.
- Minimum value and above and below the first quartile – Assign 1
- First quartile and above but below the second quartile – Assign 2
- Second quartile and above but below the third quartile – Assign 3
- Third quartile and above – Assign 4
# Code example 1
df_sales_amount = df_receipt[['customer_id', 'amount']]. \
groupby('customer_id').sum().reset_index()
pct25 = np.quantile(df_sales_amount['amount'], 0.25)
pct50 = np.quantile(df_sales_amount['amount'], 0.5)
pct75 = np.quantile(df_sales_amount['amount'], 0.75)
def pct_group(x):
if x < pct25:
return 1
elif pct25 <= x < pct50:
return 2
elif pct50 <= x < pct75:
return 3
elif pct75 <= x:
return 4
df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(pct_group)
df_sales_amount.head(10)
customer_id | amount | pct_group | |
---|---|---|---|
0 | CS001113000004 | 1298 | 2 |
1 | CS001114000005 | 626 | 2 |
2 | CS001115000010 | 3044 | 3 |
3 | CS001205000004 | 1988 | 3 |
4 | CS001205000006 | 3337 | 3 |
5 | CS001211000025 | 456 | 1 |
6 | CS001212000027 | 448 | 1 |
7 | CS001212000031 | 296 | 1 |
8 | CS001212000046 | 228 | 1 |
9 | CS001212000070 | 456 | 1 |
Commentary :
This code computes quartiles of sales amount for each customer, and then assigns each customer to one of four groups based on their sales amount quartile.
Here is a step-by-step breakdown of the code:df_sales_amount = df_receipt[['customer_id', 'amount']].groupby('customer_id').sum().reset_index()
computes the total sales amount for each customer by grouping thedf_receipt
dataframe bycustomer_id
and summing up theamount
column. The resulting dataframe has two columns:customer_id
andamount
.pct25 = np.quantile(df_sales_amount['amount'], 0.25)
,pct50 = np.quantile(df_sales_amount['amount'], 0.5)
, andpct75 = np.quantile(df_sales_amount['amount'], 0.75)
compute the 25th, 50th, and 75th percentiles of the sales amount distribution.def pct_group(x): ...
defines a functionpct_group
that takes a sales amount valuex
and returns an integer from 1 to 4 representing the quartile thatx
belongs to.df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(pct_group)
applies thepct_group
function to each value in theamount
column ofdf_sales_amount
, and assigns the resulting quartile number to a new column calledpct_group
.df_sales_amount.head(10)
displays the first 10 rows of the resulting dataframe, which now has three columns:customer_id
,amount
, andpct_group
.
# code for confirmation
print('pct25:', pct25)
print('pct50:', pct50)
print('pct75:', pct75)
pct25: 548.5 pct50: 1478.0 pct75: 3651.0
# Code example 2 (example using cut, with quartile range additionally displayed for reference)
df_temp = df_receipt[['customer_id', 'amount']]. \
groupby('customer_id').sum().reset_index()
pct25 = np.quantile(df_sales_amount['amount'], 0.25)
pct50 = np.quantile(df_sales_amount['amount'], 0.5)
pct75 = np.quantile(df_sales_amount['amount'], 0.75)
pct_max = df_sales_amount['amount'].max()
df_temp['quantile'] = pd.cut(df_sales_amount['amount'],[0.0, pct25, pct50, pct75,pct_max+0.1], right=False)
df_temp['pct_group'] = df_temp.groupby('quantile').ngroup() + 1
df_temp.head(10)
customer_id | amount | quantile | pct_group | |
---|---|---|---|---|
0 | CS001113000004 | 1298 | [548.5, 1478.0) | 2 |
1 | CS001114000005 | 626 | [548.5, 1478.0) | 2 |
2 | CS001115000010 | 3044 | [1478.0, 3651.0) | 3 |
3 | CS001205000004 | 1988 | [1478.0, 3651.0) | 3 |
4 | CS001205000006 | 3337 | [1478.0, 3651.0) | 3 |
5 | CS001211000025 | 456 | [0.0, 548.5) | 1 |
6 | CS001212000027 | 448 | [0.0, 548.5) | 1 |
7 | CS001212000031 | 296 | [0.0, 548.5) | 1 |
8 | CS001212000046 | 228 | [0.0, 548.5) | 1 |
9 | CS001212000070 | 456 | [0.0, 548.5) | 1 |
Commentary :
This code is creating a new column 'pct_group' in a DataFrame 'df_temp' that contains the total sales amount of each customer and the percentile group they belong to based on their total sales amount.
The first line groups the 'df_receipt' DataFrame by customer_id and calculates the sum of sales amount for each customer, and then resets the index to make the customer_id a column again.
The next four lines calculate the 25th, 50th, 75th percentiles and the maximum value of the sales amount from the 'df_sales_amount' DataFrame, which is the same as the previously created 'df_temp' DataFrame.
The next line creates a new column 'quantile' in 'df_temp' by cutting the sales amount into four bins with the defined percentile values.
Finally, the last line creates a new column 'pct_group' in 'df_temp' that groups the data by 'quantile' and assigns a group number (1-4) to each customer based on which percentile group they belong to.
Note that the right endpoint of the last bin is defined as 'pct_max+0.1' to ensure that the maximum sales amount value is included in the last percentile group.
# Reference code (example using qcut, not intended to meet the subject matter as the inclusion/exclusion of boundary values is reversed, but included for reference only).
df_temp = df_receipt.groupby('customer_id')[['amount']].sum()
df_temp['quantile'], bins = \
pd.qcut(df_receipt.groupby('customer_id')['amount'].sum(), 4, retbins=True)
df_temp['pct_group'] = df_temp.groupby('quantile').ngroup() + 1
df_temp.reset_index(inplace=True)
display(df_temp.head(10))
print('quantiles:', bins)
customer_id | amount | quantile | pct_group | |
---|---|---|---|---|
0 | CS001113000004 | 1298 | (548.5, 1478.0] | 2 |
1 | CS001114000005 | 626 | (548.5, 1478.0] | 2 |
2 | CS001115000010 | 3044 | (1478.0, 3651.0] | 3 |
3 | CS001205000004 | 1988 | (1478.0, 3651.0] | 3 |
4 | CS001205000006 | 3337 | (1478.0, 3651.0] | 3 |
5 | CS001211000025 | 456 | (69.999, 548.5] | 1 |
6 | CS001212000027 | 448 | (69.999, 548.5] | 1 |
7 | CS001212000031 | 296 | (69.999, 548.5] | 1 |
8 | CS001212000046 | 228 | (69.999, 548.5] | 1 |
9 | CS001212000070 | 456 | (69.999, 548.5] | 1 |
quantiles: [7.0000000e+01 5.4850000e+02 1.4780000e+03 3.6510000e+03 1.2395003e+07]
Commentary :
This code performs the following operations:
It groups thedf_receipt
dataframe bycustomer_id
and then calculates the sum ofamount
for each group using thesum()
method of the grouped object. The result is stored in a new dataframe calleddf_temp
.
Theqcut()
function is used to divide theamount
values indf_temp
into 4 equal-sized bins based on their values. Theretbins=True
option returns the edges of the bins in thebins
variable.
A new column calledquantile
is created indf_temp
that indicates which bin eachamount
value falls into.
Thengroup()
method is used to assign a group number to each bin indf_temp['quantile']
. The+1
is added to the result becausengroup()
returns 0-indexed values.
Thereset_index()
method is called ondf_temp
to reset the index to a new range index.
Finally, the first 10 rows ofdf_temp
and the edges of the bins are displayed using thedisplay()
andprint()
functions, respectively.
P-056: Calculate the age in 10-year age increments based on the age (age) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id) and date of birth (birth_day). However, all age groups above 60 years should be assumed to be in the 60s. The name of the category representing the age is optional.
# Code example 1
df_customer_era = df_customer[['customer_id', 'birth_day']].copy()
df_customer_era['era'] = df_customer['age']. \
apply(lambda x: min(math.floor(x / 10) * 10, 60))
df_customer_era.head(10)
customer_id | birth_day | era | |
---|---|---|---|
0 | CS021313000114 | 1981-04-29 | 30 |
1 | CS037613000071 | 1952-04-01 | 60 |
2 | CS031415000172 | 1976-10-04 | 40 |
3 | CS028811000001 | 1933-03-27 | 60 |
4 | CS001215000145 | 1995-03-29 | 20 |
5 | CS020401000016 | 1974-09-15 | 40 |
6 | CS015414000103 | 1977-08-09 | 40 |
7 | CS029403000008 | 1973-08-17 | 40 |
8 | CS015804000004 | 1931-05-02 | 60 |
9 | CS033513000180 | 1962-07-11 | 50 |
Commentary :
This code creates a new DataFramedf_customer_era
containing the columnscustomer_id
,birth_day
, andera
.
Thecustomer_id
andbirth_day
columns are taken from the originaldf_customer
DataFrame.
Theera
column is computed from theage
column in thedf_customer
DataFrame. For each value ofage
, the code computes the decade (in multiples of 10) by which the value should be rounded down. If the value ofage
is 60 or greater, then the era is set to 60.
Theapply
method applies a function (in this case, a lambda function) to each element in theage
column, and themin
andmath.floor
functions compute the decade. The resulting value is assigned to theera
column indf_customer_era
.
Finally, thehead
method is called to display the first 10 rows of the resulting DataFrame.
# Code example 2 (example of cut, categories are output in a range)
df_customer_era = df_customer[['customer_id', 'birth_day']].copy()
df_customer_era['era'] = pd.cut(df_customer['age'],
bins=[0, 10, 20, 30, 40, 50, 60, np.inf],
right=False)
df_customer_era[['customer_id', 'birth_day', 'era']].head(10)
customer_id | birth_day | era | |
---|---|---|---|
0 | CS021313000114 | 1981-04-29 | [30.0, 40.0) |
1 | CS037613000071 | 1952-04-01 | [60.0, inf) |
2 | CS031415000172 | 1976-10-04 | [40.0, 50.0) |
3 | CS028811000001 | 1933-03-27 | [60.0, inf) |
4 | CS001215000145 | 1995-03-29 | [20.0, 30.0) |
5 | CS020401000016 | 1974-09-15 | [40.0, 50.0) |
6 | CS015414000103 | 1977-08-09 | [40.0, 50.0) |
7 | CS029403000008 | 1973-08-17 | [40.0, 50.0) |
8 | CS015804000004 | 1931-05-02 | [60.0, inf) |
9 | CS033513000180 | 1962-07-11 | [50.0, 60.0) |
Commentary :
This code creates a new DataFramedf_customer_era
with columnscustomer_id
,birth_day
, andera
. Thecustomer_id
andbirth_day
columns are copied over from the originaldf_customer
DataFrame.
Theera
column is created by binning the ages of customers into different age ranges usingpd.cut()
. Thebins
parameter specifies the edges of each bin, wherenp.inf
is used to specify an open-ended bin for customers who are 60 years or older. Theright
parameter is set toFalse
to specify that the intervals are left-inclusive, i.e., a customer whose age is exactly 30 will be placed in the[30, 40)
bin.
For each row in the DataFrame, the age of the customer is binned into the corresponding era based on the specified bins. The resulting DataFrame contains thecustomer_id
,birth_day
, andera
columns, with theera
column containing the age range bin for each customer.
P-057: Using the extraction results of 056 and the gender code (gender_cd), create new category data representing the combination of gender x age and display 10 cases. The value of the category representing the combination shall be arbitrary.
# Generate a sex and age code by concatenating a one-digit sex code and a two-digit age code.
df_customer_era = df_customer[['customer_id', 'birth_day']].copy()
df_customer_era['era'] = df_customer['age']. \
apply(lambda x: min(math.floor(x / 10) * 10, 60))
df_customer_era['gender_era'] = \
df_customer['gender_cd'] + df_customer_era['era'].astype('str').str.zfill(2)
df_customer_era.head(10)
customer_id | birth_day | era | gender_era | |
---|---|---|---|---|
0 | CS021313000114 | 1981-04-29 | 30 | 130 |
1 | CS037613000071 | 1952-04-01 | 60 | 960 |
2 | CS031415000172 | 1976-10-04 | 40 | 140 |
3 | CS028811000001 | 1933-03-27 | 60 | 160 |
4 | CS001215000145 | 1995-03-29 | 20 | 120 |
5 | CS020401000016 | 1974-09-15 | 40 | 040 |
6 | CS015414000103 | 1977-08-09 | 40 | 140 |
7 | CS029403000008 | 1973-08-17 | 40 | 040 |
8 | CS015804000004 | 1931-05-02 | 60 | 060 |
9 | CS033513000180 | 1962-07-11 | 50 | 150 |
Commentary :
This code is creating a new column "era" in the DataFrame "df_customer_era", which indicates the decade of a customer's age. The decade values are obtained by rounding down the age to the nearest 10 and then multiplying by 10. If the result is greater than 60, it is truncated to 60.
Then, the code creates a new column "gender_era" by concatenating the gender code and the "era" column values. The "gender_cd" column contains gender codes (e.g., "0" for male, "1" for female), and the "era" column values are cast as strings and zero-padded to a width of 2.
Finally, the code displays the first 10 rows of the resulting DataFrame.
P-058: Create a dummy variable for the gender code (gender_cd) of the customer data (df_customer) and display 10 cases together with the customer ID (customer_id).
# Code example 1 (all code values itemised)
pd.get_dummies(df_customer[['customer_id', 'gender_cd']],
columns=['gender_cd']).head(10)
customer_id | gender_cd_0 | gender_cd_1 | gender_cd_9 | |
---|---|---|---|---|
0 | CS021313000114 | 0 | 1 | 0 |
1 | CS037613000071 | 0 | 0 | 1 |
2 | CS031415000172 | 0 | 1 | 0 |
3 | CS028811000001 | 0 | 1 | 0 |
4 | CS001215000145 | 0 | 1 | 0 |
5 | CS020401000016 | 1 | 0 | 0 |
6 | CS015414000103 | 0 | 1 | 0 |
7 | CS029403000008 | 1 | 0 | 0 |
8 | CS015804000004 | 1 | 0 | 0 |
9 | CS033513000180 | 0 | 1 | 0 |
Commentary :
This code creates dummy variables for the categorical variable 'gender_cd' in the DataFrame 'df_customer'. The resulting DataFrame will have a column for each category in 'gender_cd' (presumably male and female), and each row will have a 1 in the appropriate column if the original customer record had that value for 'gender_cd', and 0s in all other columns.
The 'get_dummies' function from pandas is used for this purpose. It takes in a DataFrame and a list of columns to apply the dummy encoding to. In this case, it is called on the 'df_customer' DataFrame, and the 'gender_cd' column is specified in the 'columns' parameter.
The resulting DataFrame will have one column for each value of 'gender_cd' (presumably 0 for male and 1 for female, or vice versa depending on the encoding used in the original data), and each row will have a 1 in the column corresponding to the value of 'gender_cd' in that row, and 0s in all other columns. The first 10 rows of the resulting DataFrame are returned.
# Code example 2 (one item can be deleted or the delimiter changed)
pd.get_dummies(df_customer[['customer_id', 'gender_cd']],
columns=['gender_cd'],
drop_first=True, prefix='gen', prefix_sep='#').head(10)
customer_id | gen#1 | gen#9 | |
---|---|---|---|
0 | CS021313000114 | 1 | 0 |
1 | CS037613000071 | 0 | 1 |
2 | CS031415000172 | 1 | 0 |
3 | CS028811000001 | 1 | 0 |
4 | CS001215000145 | 1 | 0 |
5 | CS020401000016 | 0 | 0 |
6 | CS015414000103 | 1 | 0 |
7 | CS029403000008 | 0 | 0 |
8 | CS015804000004 | 0 | 0 |
9 | CS033513000180 | 1 | 0 |
Commentary :
This code is creating dummy variables for the categorical variablegender_cd
in thedf_customer
DataFrame. It uses thepd.get_dummies()
function from the pandas library to create a binary indicator variable for each unique category ingender_cd
.
Thedrop_first=True
argument is used to drop the first category ('0') which is the reference category. This is done to prevent multicollinearity between the dummy variables.
Theprefix='gen'
andprefix_sep='#'
arguments are used to add a prefix to the column names of the dummy variables. In this case, the prefix is 'gen' and the separator is '#'.
The resulting DataFrame will have one column for each unique value in thegender_cd
column, with a value of 1 indicating that the customer is in that category and 0 indicating they are not. The column names will be in the formatgen#<gender_cd>
, where<gender_cd>
is the unique value in the originalgender_cd
column.
P-059: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), standardise the total sales amount to mean 0 and standard deviation 1 and display 10 cases with the customer ID and total sales amount. The standard deviation used for standardisation may be either the square root of the variance or the square root of the unbiased variance. However, exclude customer IDs starting with “Z”, as they represent non-members.
TIPS:
The query() argument engine allows you to choose between 'python' and 'numberxpr', defaulting to numberxpr if installed, otherwise python is used. Furthermore, string methods cannot be used in query() unless engine='python'.
# Calculated with the standard deviation of the data to use sklean's preprocessing.scale
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python'). \
groupby('customer_id'). \
agg({'amount':'sum'}).reset_index()
df_sales_amount['std_amount'] = preprocessing.scale(df_sales_amount['amount'])
df_sales_amount.head(10)
customer_id | amount | std_amount | |
---|---|---|---|
0 | CS001113000004 | 1298 | -0.459378 |
1 | CS001114000005 | 626 | -0.706390 |
2 | CS001115000010 | 3044 | 0.182413 |
3 | CS001205000004 | 1988 | -0.205749 |
4 | CS001205000006 | 3337 | 0.290114 |
5 | CS001211000025 | 456 | -0.768879 |
6 | CS001212000027 | 448 | -0.771819 |
7 | CS001212000031 | 296 | -0.827691 |
8 | CS001212000046 | 228 | -0.852686 |
9 | CS001212000070 | 456 | -0.768879 |
Commentary :
This code does the following:
Filters out the rows indf_receipt
wherecustomer_id
starts with the letter "Z".
Groups the remaining rows bycustomer_id
and calculates the sum of theamount
column for each group using theagg
method.
Resets the index of the resulting DataFrame to makecustomer_id
a regular column.
Applies thepreprocessing.scale
function from scikit-learn to theamount
column to calculate the z-score of each customer's total sales.
Adds a new column to the DataFrame calledstd_amount
that contains the z-scores calculated in step 4.
The resulting DataFramedf_sales_amount
contains one row for each customer, with columns for theircustomer_id
, total salesamount
, andstd_amount
, which is the z-score of their sales.
# Code example 2 (by doing a fit, you can standardise different data with the same mean and standard deviation).
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python'). \
groupby('customer_id'). \
agg({'amount':'sum'}).reset_index()
scaler = preprocessing.StandardScaler()
scaler.fit(df_sales_amount[['amount']])
df_sales_amount['std_amount'] = scaler.transform(df_sales_amount[['amount']])
df_sales_amount.head(10)
customer_id | amount | std_amount | |
---|---|---|---|
0 | CS001113000004 | 1298 | -0.459378 |
1 | CS001114000005 | 626 | -0.706390 |
2 | CS001115000010 | 3044 | 0.182413 |
3 | CS001205000004 | 1988 | -0.205749 |
4 | CS001205000006 | 3337 | 0.290114 |
5 | CS001211000025 | 456 | -0.768879 |
6 | CS001212000027 | 448 | -0.771819 |
7 | CS001212000031 | 296 | -0.827691 |
8 | CS001212000046 | 228 | -0.852686 |
9 | CS001212000070 | 456 | -0.768879 |
Commentary :
This code performs standardization (also called Z-score normalization) on the 'amount' column of the 'df_sales_amount' DataFrame.
The code first filters out the transactions made by customers whose IDs start with the letter 'Z' using the 'query' method of pandas DataFrame. It then groups the remaining transactions by 'customer_id' and calculates the total amount spent by each customer using the 'agg' method. Finally, it resets the index of the resulting DataFrame.
Next, the code creates an instance of the 'StandardScaler' class from the 'preprocessing' module of scikit-learn library. The 'StandardScaler' class is used for standardization, which scales the data so that it has a mean of zero and a standard deviation of one. The 'fit' method of the 'scaler' instance is then used to calculate the mean and standard deviation of the 'amount' column of the 'df_sales_amount' DataFrame.
Finally, the 'transform' method of the 'scaler' instance is used to perform the actual standardization on the 'amount' column of the 'df_sales_amount' DataFrame, and the resulting standardized values are stored in a new column named 'std_amount'.
The resulting DataFrame, 'df_sales_amount', contains the 'customer_id', total 'amount' spent by each customer, and the standardized 'std_amount' column.
P-060: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), normalise the total sales amount to a minimum value of 0 and a maximum value of 1 and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
# Code example 1
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python'). \
groupby('customer_id'). \
agg({'amount':'sum'}).reset_index()
df_sales_amount['scale_amount'] = \
preprocessing.minmax_scale(df_sales_amount['amount'])
df_sales_amount.head(10)
customer_id | amount | scale_amount | |
---|---|---|---|
0 | CS001113000004 | 1298 | 0.053354 |
1 | CS001114000005 | 626 | 0.024157 |
2 | CS001115000010 | 3044 | 0.129214 |
3 | CS001205000004 | 1988 | 0.083333 |
4 | CS001205000006 | 3337 | 0.141945 |
5 | CS001211000025 | 456 | 0.016771 |
6 | CS001212000027 | 448 | 0.016423 |
7 | CS001212000031 | 296 | 0.009819 |
8 | CS001212000046 | 228 | 0.006865 |
9 | CS001212000070 | 456 | 0.016771 |
Commentary :
This code performs data preprocessing on a pandas DataFramedf_receipt
. Here's what each line of the code does:
Filters out rows where the customer ID starts with "Z" using thequery()
method and regex enginepython
. The resulting DataFrame is assigned todf_sales_amount
.
Groups the DataFramedf_sales_amount
by customer ID using thegroupby()
method and calculates the sum of theamount
column for each group using theagg()
method. The resulting DataFrame is assigned todf_sales_amount
.
Resets the index ofdf_sales_amount
to turn thecustomer_id
column into a regular column using thereset_index()
method.
Applies theminmax_scale()
function from thepreprocessing
module of scikit-learn to theamount
column ofdf_sales_amount
to scale the values between 0 and 1. The resulting scaled values are stored in a new column calledscale_amount
.
Therefore, this code computes the total amount of sales made by each customer, filters out rows for certain customers (whose IDs start with "Z"), and then scales the sales amount for each customer between 0 and 1 using min-max scaling.
# Code example 2 (by doing a fit, you can standardise on the same minimum and maximum values for different data).
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python'). \
groupby('customer_id'). \
agg({'amount':'sum'}).reset_index()
scaler = preprocessing.MinMaxScaler()
scaler.fit(df_sales_amount[['amount']])
df_sales_amount['scale_amount'] = scaler.transform(df_sales_amount[['amount']])
df_sales_amount.head(10)
customer_id | amount | scale_amount | |
---|---|---|---|
0 | CS001113000004 | 1298 | 0.053354 |
1 | CS001114000005 | 626 | 0.024157 |
2 | CS001115000010 | 3044 | 0.129214 |
3 | CS001205000004 | 1988 | 0.083333 |
4 | CS001205000006 | 3337 | 0.141945 |
5 | CS001211000025 | 456 | 0.016771 |
6 | CS001212000027 | 448 | 0.016423 |
7 | CS001212000031 | 296 | 0.009819 |
8 | CS001212000046 | 228 | 0.006865 |
9 | CS001212000070 | 456 | 0.016771 |
Commentary :
This code performs feature scaling on the sales amount data for each customer in a retail store. Here's what each line does:df_sales_amount
: Creates a new DataFrame that includes thecustomer_id
and totalamount
spent by each customer, obtained by grouping thedf_receipt
DataFrame bycustomer_id
and summing theamount
for each group.scaler
: Initializes a MinMaxScaler object from the preprocessing module of scikit-learn, which will scale the data in the range [0,1].scaler.fit()
: Fits the scaler object to theamount
column ofdf_sales_amount
, computing the minimum and maximum values of the data.df_sales_amount['scale_amount']
: Creates a new column in thedf_sales_amount
DataFrame to store the scaledamount
data.scaler.transform()
: Scales theamount
data for each customer using the minimum and maximum values learned by the scaler object.df_sales_amount.head(10)
: Displays the first 10 rows of the updateddf_sales_amount
DataFrame, including the scaledamount
data in the newscale_amount
column.
Overall, this code performs feature scaling to bring the values of theamount
data for each customer into the same range, making them comparable and more useful for analysis.
Comment