参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.

The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.

Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Getting Started
 First, run the following cells
 Import the necessary libraries and read data from the database (PostgreSQL)
 The following cell imports libraries that are expected to be used, such as pandas.
 If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
 The process can be divided into multiple times.
 Names, addresses, etc. are dummy data and are not real.
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
if 'PG_PORT' in os.environ:
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
# Connector for pd.read_sql
conn = psycopg2.connect(**pgconfig)
df_customer = pd.read_sql(sql='select * from customer', con=conn)
df_category = pd.read_sql(sql='select * from category', con=conn)
df_product = pd.read_sql(sql='select * from product', con=conn)
df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
df_store = pd.read_sql(sql='select * from store', con=conn)
df_geocode = pd.read_sql(sql='select * from geocode', con=conn)
else:
if not os.path.exists('data/'):
!git clone https://github.com/TheJapanDataScientistSociety/100knockspreprocess
os.chdir('100knockspreprocess/docker/work/')
dtype = {
'customer_id': str,
'gender_cd': str,
'postal_cd': str,
'application_store_cd': str,
'status_cd': str,
'category_major_cd': str,
'category_medium_cd': str,
'category_small_cd': str,
'product_cd': str,
'store_cd': str,
'prefecture_cd': str,
'tel_no': str,
'postal_cd': str,
'street': str
}
df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
df_category = pd.read_csv("data/category.csv", dtype=dtype)
df_product = pd.read_csv("data/product.csv", dtype=dtype)
df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
df_store = pd.read_csv("data/store.csv", dtype=dtype)
df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)
fatal: destination path '100knockspreprocess' already exists and is not an empty directory.
Exercise
P021: Count the number of cases against the receipt details data (df_receipt).
len(df_receipt)
104681
Commentary :
The codelen(df_receipt)
returns the length of the DataFramedf_receipt
.
Here's an explanation of each component:len()
: This is a builtin Python function that returns the length of an object, such as a list or a string.df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name.
The combination oflen()
anddf_receipt
: By passing the DataFramedf_receipt
as an argument to thelen()
function, we are asking Python to return the number of rows indf_receipt
.
In summary, the code is simply returning the number of rows in a DataFrame.
P022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).
len(df_receipt['customer_id'].unique())
8307
Commentary :
The codelen(df_receipt['customer_id'].unique())
returns the number of unique customer IDs in thecustomer_id
column of thedf_receipt
DataFrame.
Here's an explanation of each component:len()
: This is a builtin Python function that returns the length of an object, such as a list or a string.df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name.['customer_id']
: This is a DataFrame indexing operation that selects a column with the labelcustomer_id
fromdf_receipt
. The resulting object is a pandas Series..unique()
: This is a pandas Series method that returns a numpy array of unique values in the Series. In this case, it returns a numpy array of unique customer IDs.
The combination oflen()
,df_receipt['customer_id'].unique()
, and()
: By passing the numpy array of unique customer IDs as an argument to thelen()
function, we are asking Python to return the number of unique customer IDs.
In summary, the code is returning the number of unique customer IDs in a DataFrame.
P023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).
# Code example 1
df_receipt.groupby('store_cd').agg({'amount':'sum',
'quantity':'sum'}).reset_index()
store_cd  amount  quantity  

0  S12007  638761  2099 
1  S12013  787513  2425 
2  S12014  725167  2358 
3  S12029  794741  2555 
4  S12030  684402  2403 
5  S13001  811936  2347 
6  S13002  727821  2340 
7  S13003  764294  2197 
8  S13004  779373  2390 
9  S13005  629876  2004 
10  S13008  809288  2491 
11  S13009  808870  2486 
12  S13015  780873  2248 
13  S13016  793773  2432 
14  S13017  748221  2376 
15  S13018  790535  2562 
16  S13019  827833  2541 
17  S13020  796383  2383 
18  S13031  705968  2336 
19  S13032  790501  2491 
20  S13035  715869  2219 
21  S13037  693087  2344 
22  S13038  708884  2337 
23  S13039  611888  1981 
24  S13041  728266  2233 
25  S13043  587895  1881 
26  S13044  520764  1729 
27  S13051  107452  354 
28  S13052  100314  250 
29  S14006  712839  2284 
30  S14010  790361  2290 
31  S14011  805724  2434 
32  S14012  720600  2412 
33  S14021  699511  2231 
34  S14022  651328  2047 
35  S14023  727630  2258 
36  S14024  736323  2417 
37  S14025  755581  2394 
38  S14026  824537  2503 
39  S14027  714550  2303 
40  S14028  786145  2458 
41  S14033  725318  2282 
42  S14034  653681  2024 
43  S14036  203694  635 
44  S14040  701858  2233 
45  S14042  534689  1935 
46  S14045  458484  1398 
47  S14046  412646  1354 
48  S14047  338329  1041 
49  S14048  234276  769 
50  S14049  230808  788 
51  S14050  167090  580 
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'sum', 'quantity':'sum'}).reset_index()
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then applies two aggregation functions (sum()
onamount
andquantity
) to the corresponding columns in each group. Finally, it resets the index of the resulting DataFrame.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'amount':'sum', 'quantity':'sum'})
: This is a pandas GroupBy object method that applies thesum()
aggregation function to theamount
andquantity
columns of each group in theGroupBy
object. The result is a DataFrame with two columns,amount
andquantity
, that contain the sums of these columns for each group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then computing the sum of two other columns for each group, returning a DataFrame with the sums of these columns for each group and a default index.
# Code example 2
df_receipt.groupby('store_cd')[['amount','quantity']].agg('sum').reset_index()
store_cd  amount  quantity  

0  S12007  638761  2099 
1  S12013  787513  2425 
2  S12014  725167  2358 
3  S12029  794741  2555 
4  S12030  684402  2403 
5  S13001  811936  2347 
6  S13002  727821  2340 
7  S13003  764294  2197 
8  S13004  779373  2390 
9  S13005  629876  2004 
10  S13008  809288  2491 
11  S13009  808870  2486 
12  S13015  780873  2248 
13  S13016  793773  2432 
14  S13017  748221  2376 
15  S13018  790535  2562 
16  S13019  827833  2541 
17  S13020  796383  2383 
18  S13031  705968  2336 
19  S13032  790501  2491 
20  S13035  715869  2219 
21  S13037  693087  2344 
22  S13038  708884  2337 
23  S13039  611888  1981 
24  S13041  728266  2233 
25  S13043  587895  1881 
26  S13044  520764  1729 
27  S13051  107452  354 
28  S13052  100314  250 
29  S14006  712839  2284 
30  S14010  790361  2290 
31  S14011  805724  2434 
32  S14012  720600  2412 
33  S14021  699511  2231 
34  S14022  651328  2047 
35  S14023  727630  2258 
36  S14024  736323  2417 
37  S14025  755581  2394 
38  S14026  824537  2503 
39  S14027  714550  2303 
40  S14028  786145  2458 
41  S14033  725318  2282 
42  S14034  653681  2024 
43  S14036  203694  635 
44  S14040  701858  2233 
45  S14042  534689  1935 
46  S14045  458484  1398 
47  S14046  412646  1354 
48  S14047  338329  1041 
49  S14048  234276  769 
50  S14049  230808  788 
51  S14050  167090  580 
Commentary :
This code performs a groupby operation on thedf_receipt
dataframe using thestore_cd
column, and then aggregates theamount
andquantity
columns by summing them up.
The resulting dataframe will have one row for each unique value in thestore_cd
column, and two columns:amount
andquantity
, both of which contain the sum of their respective values across all rows that have the samestore_cd
value.
The.reset_index()
method is then called on the resulting dataframe to convert thestore_cd
column from an index to a regular column.
P024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).
df_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)
customer_id  sales_ymd  

0  CS001113000004  20190308 
1  CS001114000005  20190731 
2  CS001115000010  20190405 
3  CS001205000004  20190625 
4  CS001205000006  20190224 
5  CS001211000025  20190322 
6  CS001212000027  20170127 
7  CS001212000031  20180906 
8  CS001212000046  20170811 
9  CS001212000070  20191018 
Commentary :
The code is performing groupby operation on a DataFrame calleddf_receipt
based on thecustomer_id
column. Then it is aggregating the values of thesales_ymd
column for each group using themax
function to find the latest date of purchase for each customer.
The result of the aggregation is a new DataFrame with two columns customer_id
andsales_ymd
. Thereset_index
function is then used to reset the index of the resulting DataFrame to default and thehead(10)
function is used to display the first 10 rows of the resulting DataFrame.
So, the output of this code will be a DataFrame showing the maximum purchase date (sales_ymd
) for eachcustomer_id
, with only the first 10 rows displayed.
P025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.
# The same writing as 024 is also possible, but dare to introduce an example solution using a different writing style.
df_receipt.groupby('customer_id').sales_ymd.min().reset_index().head(10)
customer_id  sales_ymd  

0  CS001113000004  20190308 
1  CS001114000005  20180503 
2  CS001115000010  20171228 
3  CS001205000004  20170914 
4  CS001205000006  20180207 
5  CS001211000025  20190322 
6  CS001212000027  20170127 
7  CS001212000031  20180906 
8  CS001212000046  20170811 
9  CS001212000070  20191018 
Commentary :
The codedf_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)
groups the rows of thedf_receipt
DataFrame by thecustomer_id
column, and then applies themax()
aggregation function to thesales_ymd
column of each group. Finally, it resets the index of the resulting DataFrame and returns the first 10 rows.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('customer_id')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thecustomer_id
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'sales_ymd': 'max'})
: This is a pandas GroupBy object method that applies themax()
aggregation function to thesales_ymd
column of each group in theGroupBy
object. The result is a DataFrame with one column,sales_ymd
, that contains the maximum value of thesales_ymd
column for each group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..head(10)
: This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum value of another column for each group, returning a DataFrame with the maximum values and a default index, and then returning the first 10 rows of the DataFrame. This code is useful for identifying the most recent purchase date for each customer.
P026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.
df_tmp = df_receipt.groupby('customer_id'). \
agg({'sales_ymd':['max','min']}).reset_index()
# Make multiindex (item) hierarchies into a singlelevel index (item) while connecting them with "_".
# You can also use df_tmp.columns = ['customer_id', 'sales_ymd_max', 'sales_ymd_min'].
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]
df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)
customer_id_  sales_ymd_max  sales_ymd_min  

1  CS001114000005  20190731  20180503 
2  CS001115000010  20190405  20171228 
3  CS001205000004  20190625  20170914 
4  CS001205000006  20190224  20180207 
13  CS001214000009  20190902  20170306 
14  CS001214000017  20191006  20180828 
16  CS001214000048  20190929  20171109 
17  CS001214000052  20190617  20180208 
20  CS001215000005  20181021  20170206 
21  CS001215000040  20171022  20170214 
Commentary :
The codedf_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index() df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)
groups the rows of thedf_receipt
DataFrame by thecustomer_id
column, and then applies two aggregation functions (max()
andmin()
) to thesales_ymd
column of each group, returning a DataFrame with the maximum and minimum purchase dates for each customer. Then, it selects only the rows where the maximum and minimum purchase dates are different, using thequery()
method, and returns the first 10 rows.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('customer_id')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thecustomer_id
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'sales_ymd':['max','min']})
: This is a pandas GroupBy object method that applies two aggregation functions (max()
andmin()
) to thesales_ymd
column of each group in theGroupBy
object. The result is a DataFrame with two columns,sales_ymd_max
andsales_ymd_min
, that contain the maximum and minimum values of thesales_ymd
column for each group, respectively..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..query('sales_ymd_max != sales_ymd_min')
: This is a pandas DataFrame method that selects only the rows of the DataFrame where the value in thesales_ymd_max
column is not equal to the value in thesales_ymd_min
column. This is done using a boolean expression in a string that is passed as an argument to thequery()
method..head(10)
: This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum and minimum values of another column for each group, returning a DataFrame with the maximum and minimum values and a default index. It then selects only the rows where the maximum and minimum values are different and returns the first 10 rows of the resulting DataFrame. This code is useful for identifying customers who have made purchases on different dates.
P027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.
df_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index(). \
sort_values('amount', ascending=False).head(5)
store_cd  amount  

28  S13052  402.867470 
12  S13015  351.111960 
7  S13003  350.915519 
30  S14010  348.791262 
5  S13001  348.470386 
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index().sort_values('amount', ascending=False).head(5)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then calculates the mean of theamount
column for each group, returning a DataFrame with the mean values and thestore_cd
column. It then resets the index of the resulting DataFrame and sorts it by the meanamount
values in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'amount':'mean'})
: This is a pandas GroupBy object method that applies themean()
aggregation function to theamount
column of each group in theGroupBy
object. The result is a DataFrame with one column,amount
, that contains the mean value of theamount
column for each group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False)
: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamount
column, in descending order. Theascending=False
parameter specifies that the sort order should be descending..head(5)
: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mean value of another column for each group, returning a DataFrame with the mean values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the mean values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest average sales.
P028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.
df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index(). \
sort_values('amount', ascending=False).head(5)
store_cd  amount  

28  S13052  190.0 
30  S14010  188.0 
51  S14050  185.0 
44  S14040  180.0 
7  S13003  180.0 
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then calculates the median of theamount
column for each group, returning a DataFrame with the median values and thestore_cd
column. It then resets the index of the resulting DataFrame and sorts it by the medianamount
values in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'amount':'median'})
: This is a pandas GroupBy object method that applies themedian()
aggregation function to theamount
column of each group in theGroupBy
object. The result is a DataFrame with one column,amount
, that contains the median value of theamount
column for each group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False)
: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamount
column, in descending order. Theascending=False
parameter specifies that the sort order should be descending..head(5)
: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the median value of another column for each group, returning a DataFrame with the median values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the median values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest median sales.
P029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.
df_receipt.groupby('store_cd').product_cd. \
apply(lambda x: x.mode()).reset_index().head(10)
store_cd  level_1  product_cd  

0  S12007  0  P060303001 
1  S12013  0  P060303001 
2  S12014  0  P060303001 
3  S12029  0  P060303001 
4  S12030  0  P060303001 
5  S13001  0  P060303001 
6  S13002  0  P060303001 
7  S13003  0  P071401001 
8  S13004  0  P060303001 
9  S13005  0  P040503001 
Commentary :
The codedf_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index().head(10)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then finds the mode of theproduct_cd
column for each group. It then returns a DataFrame with the mode values and thestore_cd
column. Thereset_index()
method resets the index of the resulting DataFrame to a default index.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..product_cd
: This is a pandas DataFrame attribute that accesses theproduct_cd
column of the DataFrame..apply(lambda x: x.mode())
: This is a pandas GroupBy object method that applies themode()
function to each group in theGroupBy
object. Themode()
function returns the most frequently occurring value(s) in each group. The resulting object is a Series with the same length as the original group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mode value of another column for each group, returning a DataFrame with the mode values and the grouping column. Themode()
function is used to find the most frequently occurring product code for each store. This code is useful for identifying the most popular products sold at each store.
P030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index(). \
sort_values('amount', ascending=False).head(5)
store_cd  amount  

28  S13052  440088.701311 
31  S14011  306314.558164 
42  S14034  296920.081011 
5  S13001  295431.993329 
12  S13015  295294.361116 
Commentary :
The codedf_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then calculates the variance of theamount
column for each group, withddof=0
. It returns a DataFrame with the variance values and thestore_cd
column. Thereset_index()
method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the variance values in descending order and returns the first 5 rows.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..amount.var(ddof=0)
: This is a pandas DataFrame method that calculates the variance of theamount
column for each group in theGroupBy
object, withddof=0
. Theddof
parameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample variance. A value of0
forddof
indicates that the population variance should be calculated, rather than the sample variance. The result is a Series with the variance values and thestore_cd
values as the index..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False)
: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamount
column, in descending order. Theascending=False
parameter specifies that the sort order should be descending..head(5)
: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population variance of another column for each group, returning a DataFrame with the variance values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the variance values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.
P031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.
TIPS:
Note that the default values for ddof are different in Pandas and Numpy
Pandas：
DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Numpy:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)
df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index(). \
sort_values('amount', ascending=False).head(5)
store_cd  amount  

28  S13052  663.391816 
31  S14011  553.456916 
42  S14034  544.903736 
5  S13001  543.536561 
12  S13015  543.409938 
Commentary :
The codedf_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then calculates the standard deviation of theamount
column for each group, withddof=0
. It returns a DataFrame with the standard deviation values and thestore_cd
column. Thereset_index()
method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the standard deviation values in descending order and returns the first 5 rows.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..amount.std(ddof=0)
: This is a pandas DataFrame method that calculates the standard deviation of theamount
column for each group in theGroupBy
object, withddof=0
. Theddof
parameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample standard deviation. A value of0
forddof
indicates that the population standard deviation should be calculated, rather than the sample standard deviation. The result is a Series with the standard deviation values and thestore_cd
values as the index..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False)
: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamount
column, in descending order. Theascending=False
parameter specifies that the sort order should be descending..head(5)
: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population standard deviation of another column for each group, returning a DataFrame with the standard deviation values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the standard deviation values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.
P032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.
# Code Example 1
np.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)
array([ 102., 170., 288., 10925.])
Commentary :
The codenp.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)
computes the quartiles of the values in theamount
column of thedf_receipt
DataFrame using thenp.percentile()
function from the NumPy library.
Here's an explanation of each component:np.percentile()
: This is a NumPy function that computes the percentiles of an array. It takes two arguments: the first argument is the array for which percentiles are to be calculated, and the second argument is the percentile values to compute. Theq
parameter is used to specify the percentiles to compute. For example,q=np.arange(1, 5) * 25
computes the percentiles at positions 25, 50, 75 in the distribution of the data. This is because thenp.arange(1, 5) * 25
expression generates an array with values[25, 50, 75]
, which are the positions of the percentiles in the distribution.df_receipt['amount']
: This is a pandas Series object that represents theamount
column of thedf_receipt
DataFrame. Thedf_receipt
DataFrame is assumed to exist in the current environment.q=np.arange(1, 5) * 25
: This is a NumPy array that specifies the percentiles to compute. Thenp.arange(1, 5) * 25
expression generates an array with values[25, 50, 75]
, which are the positions of the percentiles in the distribution.
The code computes the quartiles of theamount
column of thedf_receipt
DataFrame, which are the values that divide the distribution into four equal parts. The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the 50th percentile (also known as the median), and the third quartile (Q3) is the 75th percentile. The output of the code is an array of the quartile values, which can be used to summarize the distribution of theamount
column.
# Code Example 2
df_receipt.amount.quantile(q=np.arange(1, 5) / 4)
0.25 102.0 0.50 170.0 0.75 288.0 1.00 10925.0 Name: amount, dtype: float64
Commentary :
This code calculates the quartiles of theamount
column in thedf_receipt
dataframe using thequantile()
function from pandas. Theq
parameter specifies the quantiles to compute, which are the 1/4, 2/4 (median), 3/4, and 4/4 (maximum) quantiles.
The code calculates the quartiles by callingquantile()
on theamount
column ofdf_receipt
with theq
parameter set to an array of values[0.25, 0.5, 0.75, 1.0]
which corresponds to the quartiles 1, 2, 3 and 4 respectively.
The resulting output of this code would be an array of four values, each representing the value of the corresponding quartile.
P033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.
df_receipt.groupby('store_cd').amount.mean(). \
reset_index().query('amount >= 330')
store_cd  amount  

1  S12013  330.194130 
5  S13001  348.470386 
7  S13003  350.915519 
8  S13004  330.943949 
12  S13015  351.111960 
16  S13019  330.208616 
17  S13020  337.879932 
28  S13052  402.867470 
30  S14010  348.791262 
31  S14011  335.718333 
38  S14026  332.340588 
46  S14045  330.082073 
48  S14047  330.077073 
Commentary :
The codedf_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')
groups thedf_receipt
DataFrame by thestore_cd
column, calculates the mean value of theamount
column for each group, resets the index of the resulting DataFrame, and then filters the resulting DataFrame to only include rows where the mean value ofamount
is greater than or equal to 330.
Here's an explanation of each component:df_receipt.groupby('store_cd')
: This groups thedf_receipt
DataFrame by the values in thestore_cd
column..amount.mean()
: This calculates the mean value of theamount
column for each group..reset_index()
: This resets the index of the resulting DataFrame..query('amount >= 330')
: This filters the resulting DataFrame to only include rows where the mean value ofamount
is greater than or equal to 330.
In summary, this code selects only the stores whose average transaction amount is greater than or equal to 330 yen.
P034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent nonmembers.
# # Code example 1: Writing without query
df_receipt[~df_receipt['customer_id'].str.startswith("Z")]. \
groupby('customer_id').amount.sum().mean()
2547.742234529256
Commentary :
The codedf_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()
does the following:df_receipt[~df_receipt['customer_id'].str.startswith("Z")]
: This filters thedf_receipt
DataFrame to exclude rows where thecustomer_id
column starts with the letter "Z". The tilde (~) character is used as a negation operator to invert the boolean values returned by thestr.startswith()
method..groupby('customer_id').amount.sum()
: This groups the filtered DataFrame by thecustomer_id
column, and calculates the sum of theamount
column for each group..mean()
: This calculates the mean value of the resulting Series, which represents the total amount spent by each customer (excluding customers whosecustomer_id
starts with "Z").
In summary, this code computes the average amount spent by each customer whosecustomer_id
does not start with the letter "Z" in thedf_receipt
DataFrame. This may be useful for understanding the purchasing behavior of regular customers.
# Code example 2: Writing with query
df_receipt.query('not customer_id.str.startswith("Z")',
engine='python').groupby('customer_id').amount.sum().mean()
2547.742234529256
Commentary :
This code computes the average amount spent by nonanonymous customers. Here's how it works:df_receipt
is a DataFrame containing information about transactions.
Thequery
method is used to filter out transactions where thecustomer_id
starts with the letter "Z". This is done using thestr.startswith()
method, which checks if a string starts with a certain character or substring.
The resulting DataFrame is grouped bycustomer_id
using thegroupby
method.
Theamount
column is selected using thesum
method to compute the total amount spent by each customer.
Themean
method is then used to compute the average of the amounts computed in the previous step. This is the final result of the code.
P035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent nonmembers.
df_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].\
groupby('customer_id').amount.sum()
amount_mean = df_amount_sum.mean()
df_amount_sum = df_amount_sum.reset_index()
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
customer_id  amount  

2  CS001115000010  3044 
4  CS001205000006  3337 
13  CS001214000009  4685 
14  CS001214000017  4132 
17  CS001214000052  5639 
21  CS001215000040  3496 
30  CS001304000006  3726 
32  CS001305000005  3485 
33  CS001305000011  4370 
53  CS001315000180  3300 
Commentary :
The codedf_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum()
,amount_mean = df_amount_sum.mean()
,df_amount_sum = df_amount_sum.reset_index()
, anddf_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
does the following:df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum()
: This filters thedf_receipt
DataFrame to exclude rows where thecustomer_id
column starts with the letter "Z", groups the resulting DataFrame by thecustomer_id
column, and calculates the sum of theamount
column for each group. The resulting object is a Series that represents the total amount spent by each customer (excluding customers whosecustomer_id
starts with "Z").amount_mean = df_amount_sum.mean()
: This calculates the mean value of thedf_amount_sum
Series, which represents the average amount spent by each customer whosecustomer_id
does not start with the letter "Z" in thedf_receipt
DataFrame.df_amount_sum = df_amount_sum.reset_index()
: This resets the index of thedf_amount_sum
Series and converts it to a DataFrame with two columns:customer_id
andamount
.df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
: This filters thedf_amount_sum
DataFrame to include only rows where theamount
column is greater than or equal toamount_mean
, and returns the first 10 rows of the resulting DataFrame.
In summary, this code selects the customers whose total spending is greater than or equal to the average spending of all customers whosecustomer_id
does not start with the letter "Z" in thedf_receipt
DataFrame. The resulting DataFrame contains thecustomer_id
and total amount spent for each selected customer, sorted by descending order of total spending.
P036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.
pd.merge(df_receipt, df_store[['store_cd','store_name']],
how='inner', on='store_cd').head(10)
sales_ymd  sales_epoch  store_cd  receipt_no  receipt_sub_no  customer_id  product_cd  quantity  amount  store_name  

0  20181103  1541203200  S14006  112  1  CS006214000001  P070305012  1  158  葛が谷店 
1  20181116  1542326400  S14006  112  2  ZZ000000000000  P080401001  1  48  葛が谷店 
2  20170118  1484697600  S14006  1162  1  CS006815000006  P050406035  1  220  葛が谷店 
3  20190524  1558656000  S14006  1192  1  CS006514000034  P060104003  1  80  葛が谷店 
4  20190419  1555632000  S14006  112  2  ZZ000000000000  P060501002  1  148  葛が谷店 
5  20181119  1542585600  S14006  1152  2  ZZ000000000000  P050701001  1  88  葛が谷店 
6  20171211  1512950400  S14006  1132  2  CS006515000175  P090903001  1  80  葛が谷店 
7  20191021  1571616000  S14006  1112  2  CS006415000221  P040602001  1  405  葛が谷店 
8  20170710  1499644800  S14006  1132  2  CS006411000036  P090301051  1  330  葛が谷店 
9  20190805  1564963200  S14006  112  1  CS006211000012  P050104001  1  115  葛が谷店 
Commentary :
The codepd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)
does the following:df_store[['store_cd','store_name']]
: This selects a subset of thedf_store
DataFrame with two columns:store_cd
andstore_name
.pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd')
: This performs an inner join between thedf_receipt
DataFrame and thedf_store[['store_cd','store_name']]
DataFrame on thestore_cd
column. This will merge the two DataFrames on thestore_cd
column, and only include rows that have matching values in both DataFrames..head(10)
: This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code merges thedf_receipt
DataFrame with thedf_store
DataFrame on thestore_cd
column, and adds thestore_name
column from thedf_store
DataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the stores where the transactions indf_receipt
occurred, including thestore_cd
,store_name
, and all other columns from thedf_receipt
DataFrame.
P037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category subcategory names (category_small_name) of the category data.
pd.merge(df_product
, df_category[['category_small_cd','category_small_name']]
, how='inner', on='category_small_cd').head(10)
product_cd  category_major_cd  category_medium_cd  category_small_cd  unit_price  unit_cost  category_small_name  

0  P040101001  04  0401  040101  198.0  149.0  弁当類 
1  P040101002  04  0401  040101  218.0  164.0  弁当類 
2  P040101003  04  0401  040101  230.0  173.0  弁当類 
3  P040101004  04  0401  040101  248.0  186.0  弁当類 
4  P040101005  04  0401  040101  268.0  201.0  弁当類 
5  P040101006  04  0401  040101  298.0  224.0  弁当類 
6  P040101007  04  0401  040101  338.0  254.0  弁当類 
7  P040101008  04  0401  040101  420.0  315.0  弁当類 
8  P040101009  04  0401  040101  498.0  374.0  弁当類 
9  P040101010  04  0401  040101  580.0  435.0  弁当類 
Commentary :
The codepd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd').head(10)
does the following:df_category[['category_small_cd','category_small_name']]
: This selects a subset of thedf_category
DataFrame with two columns:category_small_cd
andcategory_small_name
.pd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd')
: This performs an inner join between thedf_product
DataFrame and thedf_category[['category_small_cd','category_small_name']]
DataFrame on thecategory_small_cd
column. This will merge the two DataFrames on thecategory_small_cd
column, and only include rows that have matching values in both DataFrames..head(10)
: This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code merges thedf_product
DataFrame with thedf_category
DataFrame on thecategory_small_cd
column, and adds thecategory_small_name
column from thedf_category
DataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the products indf_product
, including thecategory_small_cd
,category_small_name
, and all other columns from thedf_product
DataFrame.
P038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and nonmembers (whose customer ID starts with “Z”) should be excluded.
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_tmp = df_customer. \
query('gender_cd == "1" and not customer_id.str.startswith("Z")',
engine='python')
pd.merge(df_tmp['customer_id'], df_amount_sum,
how='left', on='customer_id').fillna(0).head(10)
customer_id  amount  

0  CS021313000114  0.0 
1  CS031415000172  5088.0 
2  CS028811000001  0.0 
3  CS001215000145  875.0 
4  CS015414000103  3122.0 
5  CS033513000180  868.0 
6  CS035614000014  0.0 
7  CS011215000048  3444.0 
8  CS009413000079  0.0 
9  CS040412000191  210.0 
Commentary :
The codedf_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index() df_tmp = df_customer. query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python') pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)
does the following:df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
: This creates a new DataFramedf_amount_sum
by grouping thedf_receipt
DataFrame bycustomer_id
and summing theamount
column for each customer. The resulting DataFrame has two columns:customer_id
andamount
.df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python')
: This selects a subset of thedf_customer
DataFrame wheregender_cd
equals 1 andcustomer_id
does not start with "Z". Theengine='python'
argument is used to allow the use of string methods likestartswith
.pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id')
: This merges thecustomer_id
column from thedf_tmp
DataFrame with thedf_amount_sum
DataFrame on thecustomer_id
column using a left join. This includes all rows from thedf_tmp
DataFrame and adds theamount
column from thedf_amount_sum
DataFrame, where available..fillna(0)
: This fills any missing values in theamount
column with 0..head(10)
: This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code calculates the total amount spent by each customer in thedf_receipt
DataFrame, and then selects a subset of customers from thedf_customer
DataFrame who are male and have validcustomer_id
values. The code then merges the two DataFrames on thecustomer_id
column using a left join, and includes the total amount spent by each customer in the resulting DataFrame. Any missing values in theamount
column are filled with 0. The resulting DataFrame contains thecustomer_id
andamount
columns for the selected male customers.
P039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude nonmembers (whose customer ID starts with “Z”).
df_data = df_receipt \
.query('not customer_id.str.startswith("Z")', engine='python')
df_cnt = df_data[~df_data.duplicated(subset=['customer_id', 'sales_ymd'])] \
.groupby('customer_id').sales_ymd.count().reset_index() \
.sort_values('sales_ymd', ascending=False).head(20)
df_sum = df_data.groupby('customer_id').amount.sum().reset_index() \
.sort_values('amount', ascending=False).head(20)
pd.merge(df_cnt, df_sum, how='outer', on='customer_id')
customer_id  sales_ymd  amount  

0  CS040214000008  23.0  NaN 
1  CS015415000185  22.0  20153.0 
2  CS010214000010  22.0  18585.0 
3  CS010214000002  21.0  NaN 
4  CS028415000007  21.0  19127.0 
5  CS017415000097  20.0  23086.0 
6  CS016415000141  20.0  18372.0 
7  CS031414000051  19.0  19202.0 
8  CS014214000023  19.0  NaN 
9  CS022515000226  19.0  NaN 
10  CS021515000172  19.0  NaN 
11  CS039414000052  19.0  NaN 
12  CS021514000045  19.0  NaN 
13  CS022515000028  18.0  NaN 
14  CS030214000008  18.0  NaN 
15  CS021515000056  18.0  NaN 
16  CS014415000077  18.0  NaN 
17  CS021515000211  18.0  NaN 
18  CS032415000209  18.0  NaN 
19  CS031414000073  18.0  NaN 
20  CS001605000009  NaN  18925.0 
21  CS006515000023  NaN  18372.0 
22  CS011414000106  NaN  18338.0 
23  CS038415000104  NaN  17847.0 
24  CS035414000024  NaN  17615.0 
25  CS021515000089  NaN  17580.0 
26  CS032414000072  NaN  16563.0 
27  CS016415000101  NaN  16348.0 
28  CS011415000006  NaN  16094.0 
29  CS034415000047  NaN  16083.0 
30  CS007514000094  NaN  15735.0 
31  CS009414000059  NaN  15492.0 
32  CS030415000034  NaN  15468.0 
33  CS015515000034  NaN  15300.0 
Commentary :
This code is performing the following steps:
Filtering the rows in thedf_receipt
dataframe where thecustomer_id
column does not start with "Z" using the.query()
method and saving it todf_data
.
Removing duplicate rows fromdf_data
based oncustomer_id
andsales_ymd
columns, then grouping bycustomer_id
, counting the number of sales dates and saving it todf_cnt
.
Sortingdf_cnt
by thesales_ymd
column in descending order and taking the top 20 rows.
Groupingdf_data
bycustomer_id
and summing theamount
column, then sorting byamount
column in descending order and taking the top 20 rows, and saving it todf_sum
.
Mergingdf_cnt
anddf_sum
dataframes on thecustomer_id
column using thepd.merge()
method withhow='outer'
parameter, which means that all records from both dataframes will be included in the merged dataframe, and saving it todf_data
. The resulting dataframe will have the top 20 customers based on the count of sales dates and the sum of their purchase amount.
P040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.
df_store_tmp = df_store.copy() df_product_tmp = df_product.copy() df_store_tmp['key'] = 0 df_product_tmp['key'] = 0 len(pd.merge(df_store_tmp, df_product_tmp, how='outer', on='key'))
531590
Commentary :
This code merges the two dataframesdf_store
anddf_product
using an outer join.
First, two temporary dataframesdf_store_tmp
anddf_product_tmp
are created as copies ofdf_store
anddf_product
, respectively. Then, a new column calledkey
is added to both dataframes with a constant value of 0. This is done because an outer join requires a common column to merge on.
Finally, the two dataframes are merged using thepd.merge()
function, withhow='outer'
argument indicating an outer join is desired andon='key'
specifying the common column to merge on. The resulting dataframe contains all rows from both dataframes, with missing values (NaN
) in the cells where there was no corresponding data in the other dataframe.
Lastly,len()
is used to get the number of rows in the merged dataframe. Since both dataframes have the same number of rows, the output would be the same whether the inner, left, right, or outer join was used. In this case, it will output the total number of rows in the merged dataframe.
Comment