参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_Python.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Getting Started

First, run the following cells
Import the necessary libraries and read data from the database (PostgreSQL)
The following cell imports libraries that are expected to be used, such as pandas.
If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
The process can be divided into multiple times.
Names, addresses, etc. are dummy data and are not real.

In [ ]:

import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler


if 'PG_PORT' in os.environ:
    pgconfig = {
        'host': 'db',
        'port': os.environ['PG_PORT'],
        'database': os.environ['PG_DATABASE'],
        'user': os.environ['PG_USER'],
        'password': os.environ['PG_PASSWORD'],
    }


    # Connector for pd.read_sql
    conn = psycopg2.connect(**pgconfig)

    df_customer = pd.read_sql(sql='select * from customer', con=conn)
    df_category = pd.read_sql(sql='select * from category', con=conn)
    df_product = pd.read_sql(sql='select * from product', con=conn)
    df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
    df_store = pd.read_sql(sql='select * from store', con=conn)
    df_geocode = pd.read_sql(sql='select * from geocode', con=conn)

else:
    if not os.path.exists('data/'):
        !git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
        os.chdir('100knocks-preprocess/docker/work/')

    dtype = {
        'customer_id': str,
        'gender_cd': str,
        'postal_cd': str,
        'application_store_cd': str,
        'status_cd': str,
        'category_major_cd': str,
        'category_medium_cd': str,
        'category_small_cd': str,
        'product_cd': str,
        'store_cd': str,
        'prefecture_cd': str,
        'tel_no': str,
        'postal_cd': str,
        'street': str
    }

    df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
    df_category = pd.read_csv("data/category.csv", dtype=dtype)
    df_product = pd.read_csv("data/product.csv", dtype=dtype)
    df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
    df_store = pd.read_csv("data/store.csv", dtype=dtype)
    df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)

fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.

Exercise

P-021: Count the number of cases against the receipt details data (df_receipt).

In [ ]:

len(df_receipt)

Out[ ]:

Commentary :

The code len(df_receipt) returns the length of the DataFrame df_receipt.

Here's an explanation of each component:

len(): This is a built-in Python function that returns the length of an object, such as a list or a string.

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

The combination of len() and df_receipt: By passing the DataFrame df_receipt as an argument to the len() function, we are asking Python to return the number of rows in df_receipt.

In summary, the code is simply returning the number of rows in a DataFrame.

P-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).

In [ ]:

len(df_receipt['customer_id'].unique())

Out[ ]:

Commentary : 

The code len(df_receipt['customer_id'].unique()) returns the number of unique customer IDs in the customer_id column of the df_receipt DataFrame.

Here's an explanation of each component:

len(): This is a built-in Python function that returns the length of an object, such as a list or a string.

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

['customer_id']: This is a DataFrame indexing operation that selects a column with the label customer_id from df_receipt. The resulting object is a pandas Series.

.unique(): This is a pandas Series method that returns a numpy array of unique values in the Series. In this case, it returns a numpy array of unique customer IDs.

The combination of len(), df_receipt['customer_id'].unique(), and (): By passing the numpy array of unique customer IDs as an argument to the len() function, we are asking Python to return the number of unique customer IDs.

In summary, the code is returning the number of unique customer IDs in a DataFrame.

P-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).

In [ ]:

# Code example 1
df_receipt.groupby('store_cd').agg({'amount':'sum', 
                                    'quantity':'sum'}).reset_index()

Out[ ]:

	store_cd	amount	quantity
0	S12007	638761	2099
1	S12013	787513	2425
2	S12014	725167	2358
3	S12029	794741	2555
4	S12030	684402	2403
5	S13001	811936	2347
6	S13002	727821	2340
7	S13003	764294	2197
8	S13004	779373	2390
9	S13005	629876	2004
10	S13008	809288	2491
11	S13009	808870	2486
12	S13015	780873	2248
13	S13016	793773	2432
14	S13017	748221	2376
15	S13018	790535	2562
16	S13019	827833	2541
17	S13020	796383	2383
18	S13031	705968	2336
19	S13032	790501	2491
20	S13035	715869	2219
21	S13037	693087	2344
22	S13038	708884	2337
23	S13039	611888	1981
24	S13041	728266	2233
25	S13043	587895	1881
26	S13044	520764	1729
27	S13051	107452	354
28	S13052	100314	250
29	S14006	712839	2284
30	S14010	790361	2290
31	S14011	805724	2434
32	S14012	720600	2412
33	S14021	699511	2231
34	S14022	651328	2047
35	S14023	727630	2258
36	S14024	736323	2417
37	S14025	755581	2394
38	S14026	824537	2503
39	S14027	714550	2303
40	S14028	786145	2458
41	S14033	725318	2282
42	S14034	653681	2024
43	S14036	203694	635
44	S14040	701858	2233
45	S14042	534689	1935
46	S14045	458484	1398
47	S14046	412646	1354
48	S14047	338329	1041
49	S14048	234276	769
50	S14049	230808	788
51	S14050	167090	580

Commentary : 

The code df_receipt.groupby('store_cd').agg({'amount':'sum', 'quantity':'sum'}).reset_index() groups the rows of the df_receipt DataFrame by the store_cd column, and then applies two aggregation functions (sum() on amount and quantity) to the corresponding columns in each group. Finally, it resets the index of the resulting DataFrame.

Here's an explanation of each component:

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

.groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the store_cd column. This creates a GroupBy object that can be used to apply functions to each group separately.

.agg({'amount':'sum', 'quantity':'sum'}): This is a pandas GroupBy object method that applies the sum() aggregation function to the amount and quantity columns of each group in the GroupBy object. The result is a DataFrame with two columns, amount and quantity, that contain the sums of these columns for each group.

.reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then computing the sum of two other columns for each group, returning a DataFrame with the sums of these columns for each group and a default index.

In [ ]:

# Code example 2
df_receipt.groupby('store_cd')[['amount','quantity']].agg('sum').reset_index()

Out[ ]:

	store_cd	amount	quantity
0	S12007	638761	2099
1	S12013	787513	2425
2	S12014	725167	2358
3	S12029	794741	2555
4	S12030	684402	2403
5	S13001	811936	2347
6	S13002	727821	2340
7	S13003	764294	2197
8	S13004	779373	2390
9	S13005	629876	2004
10	S13008	809288	2491
11	S13009	808870	2486
12	S13015	780873	2248
13	S13016	793773	2432
14	S13017	748221	2376
15	S13018	790535	2562
16	S13019	827833	2541
17	S13020	796383	2383
18	S13031	705968	2336
19	S13032	790501	2491
20	S13035	715869	2219
21	S13037	693087	2344
22	S13038	708884	2337
23	S13039	611888	1981
24	S13041	728266	2233
25	S13043	587895	1881
26	S13044	520764	1729
27	S13051	107452	354
28	S13052	100314	250
29	S14006	712839	2284
30	S14010	790361	2290
31	S14011	805724	2434
32	S14012	720600	2412
33	S14021	699511	2231
34	S14022	651328	2047
35	S14023	727630	2258
36	S14024	736323	2417
37	S14025	755581	2394
38	S14026	824537	2503
39	S14027	714550	2303
40	S14028	786145	2458
41	S14033	725318	2282
42	S14034	653681	2024
43	S14036	203694	635
44	S14040	701858	2233
45	S14042	534689	1935
46	S14045	458484	1398
47	S14046	412646	1354
48	S14047	338329	1041
49	S14048	234276	769
50	S14049	230808	788
51	S14050	167090	580

Commentary : 

This code performs a groupby operation on the df_receipt dataframe using the store_cd column, and then aggregates the amount and quantity columns by summing them up.

The resulting dataframe will have one row for each unique value in the store_cd column, and two columns: amount and quantity, both of which contain the sum of their respective values across all rows that have the same store_cd value.

The .reset_index() method is then called on the resulting dataframe to convert the store_cd column from an index to a regular column.

P-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).

In [ ]:

df_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)

Out[ ]:

	customer_id	sales_ymd
0	CS001113000004	20190308
1	CS001114000005	20190731
2	CS001115000010	20190405
3	CS001205000004	20190625
4	CS001205000006	20190224
5	CS001211000025	20190322
6	CS001212000027	20170127
7	CS001212000031	20180906
8	CS001212000046	20170811
9	CS001212000070	20191018

Commentary :

The code is performing groupby operation on a DataFrame called df_receipt based on the customer_id column. Then it is aggregating the values of the sales_ymd column for each group using the max function to find the latest date of purchase for each customer.

The result of the aggregation is a new DataFrame with two columns - customer_id and sales_ymd. The reset_index function is then used to reset the index of the resulting DataFrame to default and the head(10) function is used to display the first 10 rows of the resulting DataFrame.

So, the output of this code will be a DataFrame showing the maximum purchase date (sales_ymd) for each customer_id, with only the first 10 rows displayed.

P-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.

In [ ]:

# The same writing as 024 is also possible, but dare to introduce an example solution using a different writing style.
df_receipt.groupby('customer_id').sales_ymd.min().reset_index().head(10)

Out[ ]:

	customer_id	sales_ymd
0	CS001113000004	20190308
1	CS001114000005	20180503
2	CS001115000010	20171228
3	CS001205000004	20170914
4	CS001205000006	20180207
5	CS001211000025	20190322
6	CS001212000027	20170127
7	CS001212000031	20180906
8	CS001212000046	20170811
9	CS001212000070	20191018

Commentary :

The code df_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10) groups the rows of the df_receipt DataFrame by the customer_id column, and then applies the max() aggregation function to the sales_ymd column of each group. Finally, it resets the index of the resulting DataFrame and returns the first 10 rows.

Here's an explanation of each component:

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

.groupby('customer_id'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the customer_id column. This creates a GroupBy object that can be used to apply functions to each group separately.

.agg({'sales_ymd': 'max'}): This is a pandas GroupBy object method that applies the max() aggregation function to the sales_ymd column of each group in the GroupBy object. The result is a DataFrame with one column, sales_ymd, that contains the maximum value of the sales_ymd column for each group.

.reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

.head(10): This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum value of another column for each group, returning a DataFrame with the maximum values and a default index, and then returning the first 10 rows of the DataFrame. This code is useful for identifying the most recent purchase date for each customer.

P-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.

In [ ]:

df_tmp = df_receipt.groupby('customer_id'). \
                agg({'sales_ymd':['max','min']}).reset_index()

# Make multi-index (item) hierarchies into a single-level index (item) while connecting them with "_".
# You can also use df_tmp.columns = ['customer_id', 'sales_ymd_max', 'sales_ymd_min'].
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]

df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)

Out[ ]:

	customer_id_	sales_ymd_max	sales_ymd_min
1	CS001114000005	20190731	20180503
2	CS001115000010	20190405	20171228
3	CS001205000004	20190625	20170914
4	CS001205000006	20190224	20180207
13	CS001214000009	20190902	20170306
14	CS001214000017	20191006	20180828
16	CS001214000048	20190929	20171109
17	CS001214000052	20190617	20180208
20	CS001215000005	20181021	20170206
21	CS001215000040	20171022	20170214

Commentary :

The code df_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index() df_tmp.query('sales_ymd_max != sales_ymd_min').head(10) groups the rows of the df_receipt DataFrame by the customer_id column, and then applies two aggregation functions (max() and min()) to the sales_ymd column of each group, returning a DataFrame with the maximum and minimum purchase dates for each customer. Then, it selects only the rows where the maximum and minimum purchase dates are different, using the query() method, and returns the first 10 rows.

Here's an explanation of each component:

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

.groupby('customer_id'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the customer_id column. This creates a GroupBy object that can be used to apply functions to each group separately.

.agg({'sales_ymd':['max','min']}): This is a pandas GroupBy object method that applies two aggregation functions (max() and min()) to the sales_ymd column of each group in the GroupBy object. The result is a DataFrame with two columns, sales_ymd_max and sales_ymd_min, that contain the maximum and minimum values of the sales_ymd column for each group, respectively.

.reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

.query('sales_ymd_max != sales_ymd_min'): This is a pandas DataFrame method that selects only the rows of the DataFrame where the value in the sales_ymd_max column is not equal to the value in the sales_ymd_min column. This is done using a boolean expression in a string that is passed as an argument to the query() method.

.head(10): This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum and minimum values of another column for each group, returning a DataFrame with the maximum and minimum values and a default index. It then selects only the rows where the maximum and minimum values are different and returns the first 10 rows of the resulting DataFrame. This code is useful for identifying customers who have made purchases on different dates.

P-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.

In [ ]:

df_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index(). \
                            sort_values('amount', ascending=False).head(5)

Out[ ]:

	store_cd	amount
28	S13052	402.867470
12	S13015	351.111960
7	S13003	350.915519
30	S14010	348.791262
5	S13001	348.470386

Commentary :

The code df_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index().sort_values('amount', ascending=False).head(5) groups the rows of the df_receipt DataFrame by the store_cd column, and then calculates the mean of the amount column for each group, returning a DataFrame with the mean values and the store_cd column. It then resets the index of the resulting DataFrame and sorts it by the mean amount values in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.

Here's an explanation of each component:

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

.groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the store_cd column. This creates a GroupBy object that can be used to apply functions to each group separately.

.agg({'amount':'mean'}): This is a pandas GroupBy object method that applies the mean() aggregation function to the amount column of each group in the GroupBy object. The result is a DataFrame with one column, amount, that contains the mean value of the amount column for each group.

.reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

.sort_values('amount', ascending=False): This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in the amount column, in descending order. The ascending=False parameter specifies that the sort order should be descending.

.head(5): This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mean value of another column for each group, returning a DataFrame with the mean values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the mean values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest average sales.

P-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.

In [ ]:

df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index(). \
                            sort_values('amount', ascending=False).head(5)

Out[ ]:

	store_cd	amount
28	S13052	190.0
30	S14010	188.0
51	S14050	185.0
44	S14040	180.0
7	S13003	180.0

Commentary :

The code df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5) groups the rows of the df_receipt DataFrame by the store_cd column, and then calculates the median of the amount column for each group, returning a DataFrame with the median values and the store_cd column. It then resets the index of the resulting DataFrame and sorts it by the median amount values in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.

Here's an explanation of each component:

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

.groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the store_cd column. This creates a GroupBy object that can be used to apply functions to each group separately.

.agg({'amount':'median'}): This is a pandas GroupBy object method that applies the median() aggregation function to the amount column of each group in the GroupBy object. The result is a DataFrame with one column, amount, that contains the median value of the amount column for each group.

.reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

.sort_values('amount', ascending=False): This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in the amount column, in descending order. The ascending=False parameter specifies that the sort order should be descending.

.head(5): This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the median value of another column for each group, returning a DataFrame with the median values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the median values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest median sales.

P-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.

In [ ]:

df_receipt.groupby('store_cd').product_cd. \
            apply(lambda x: x.mode()).reset_index().head(10)

Out[ ]:

	store_cd	product_cd
0	S12007	P060303001
1	S12013	P060303001
2	S12014	P060303001
3	S12029	P060303001
4	S12030	P060303001
5	S13001	P060303001
6	S13002	P060303001
7	S13003	P071401001
8	S13004	P060303001
9	S13005	P040503001

Commentary :

The code df_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index().head(10) groups the rows of the df_receipt DataFrame by the store_cd column, and then finds the mode of the product_cd column for each group. It then returns a DataFrame with the mode values and the store_cd column. The reset_index() method resets the index of the resulting DataFrame to a default index.

Here's an explanation of each component:

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

.groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the store_cd column. This creates a GroupBy object that can be used to apply functions to each group separately.

.product_cd: This is a pandas DataFrame attribute that accesses the product_cd column of the DataFrame.

.apply(lambda x: x.mode()): This is a pandas GroupBy object method that applies the mode() function to each group in the GroupBy object. The mode() function returns the most frequently occurring value(s) in each group. The resulting object is a Series with the same length as the original group.

.reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mode value of another column for each group, returning a DataFrame with the mode values and the grouping column. The mode() function is used to find the most frequently occurring product code for each store. This code is useful for identifying the most popular products sold at each store.

P-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.

In [ ]:

df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index(). \
                            sort_values('amount', ascending=False).head(5)

Out[ ]:

	store_cd	amount
28	S13052	440088.701311
31	S14011	306314.558164
42	S14034	296920.081011
5	S13001	295431.993329
12	S13015	295294.361116

Commentary :

The code df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5) groups the rows of the df_receipt DataFrame by the store_cd column, and then calculates the variance of the amount column for each group, with ddof=0. It returns a DataFrame with the variance values and the store_cd column. The reset_index() method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the variance values in descending order and returns the first 5 rows.

Here's an explanation of each component:

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

.groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the store_cd column. This creates a GroupBy object that can be used to apply functions to each group separately.

.amount.var(ddof=0): This is a pandas DataFrame method that calculates the variance of the amount column for each group in the GroupBy object, with ddof=0. The ddof parameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample variance. A value of 0 for ddof indicates that the population variance should be calculated, rather than the sample variance. The result is a Series with the variance values and the store_cd values as the index.

.reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

.sort_values('amount', ascending=False): This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in the amount column, in descending order. The ascending=False parameter specifies that the sort order should be descending.

.head(5): This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population variance of another column for each group, returning a DataFrame with the variance values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the variance values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.

P-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.

TIPS:
Note that the default values for ddof are different in Pandas and Numpy

Pandas：
DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Numpy:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)

In [ ]:

df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index(). \
                            sort_values('amount', ascending=False).head(5)

Out[ ]:

	store_cd	amount
28	S13052	663.391816
31	S14011	553.456916
42	S14034	544.903736
5	S13001	543.536561
12	S13015	543.409938

Commentary :

The code df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head(5) groups the rows of the df_receipt DataFrame by the store_cd column, and then calculates the standard deviation of the amount column for each group, with ddof=0. It returns a DataFrame with the standard deviation values and the store_cd column. The reset_index() method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the standard deviation values in descending order and returns the first 5 rows.

Here's an explanation of each component:

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

.groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the store_cd column. This creates a GroupBy object that can be used to apply functions to each group separately.

.amount.std(ddof=0): This is a pandas DataFrame method that calculates the standard deviation of the amount column for each group in the GroupBy object, with ddof=0. The ddof parameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample standard deviation. A value of 0 for ddof indicates that the population standard deviation should be calculated, rather than the sample standard deviation. The result is a Series with the standard deviation values and the store_cd values as the index.

.reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

.sort_values('amount', ascending=False): This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in the amount column, in descending order. The ascending=False parameter specifies that the sort order should be descending.

.head(5): This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population standard deviation of another column for each group, returning a DataFrame with the standard deviation values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the standard deviation values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.

P-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.

In [ ]:

# Code Example 1
np.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)

Out[ ]:

array([  102.,   170.,   288., 10925.])

Commentary :

The code np.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25) computes the quartiles of the values in the amount column of the df_receipt DataFrame using the np.percentile() function from the NumPy library.

Here's an explanation of each component:

np.percentile(): This is a NumPy function that computes the percentiles of an array. It takes two arguments: the first argument is the array for which percentiles are to be calculated, and the second argument is the percentile values to compute. The q parameter is used to specify the percentiles to compute. For example, q=np.arange(1, 5) * 25 computes the percentiles at positions 25, 50, 75 in the distribution of the data. This is because the np.arange(1, 5) * 25 expression generates an array with values [25, 50, 75], which are the positions of the percentiles in the distribution.

df_receipt['amount']: This is a pandas Series object that represents the amount column of the df_receipt DataFrame. The df_receipt DataFrame is assumed to exist in the current environment.

q=np.arange(1, 5) * 25: This is a NumPy array that specifies the percentiles to compute. The np.arange(1, 5) * 25 expression generates an array with values [25, 50, 75], which are the positions of the percentiles in the distribution.

The code computes the quartiles of the amount column of the df_receipt DataFrame, which are the values that divide the distribution into four equal parts. The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the 50th percentile (also known as the median), and the third quartile (Q3) is the 75th percentile. The output of the code is an array of the quartile values, which can be used to summarize the distribution of the amount column.

In [ ]:

# Code Example 2
df_receipt.amount.quantile(q=np.arange(1, 5) / 4)

Out[ ]:

0.25      102.0
0.50      170.0
0.75      288.0
1.00    10925.0
Name: amount, dtype: float64

Commentary :

This code calculates the quartiles of the amount column in the df_receipt dataframe using the quantile() function from pandas. The q parameter specifies the quantiles to compute, which are the 1/4, 2/4 (median), 3/4, and 4/4 (maximum) quantiles.

The code calculates the quartiles by calling quantile() on the amount column of df_receipt with the q parameter set to an array of values [0.25, 0.5, 0.75, 1.0] which corresponds to the quartiles 1, 2, 3 and 4 respectively.

The resulting output of this code would be an array of four values, each representing the value of the corresponding quartile.

P-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.

In [ ]:

df_receipt.groupby('store_cd').amount.mean(). \
                    reset_index().query('amount >= 330')

Out[ ]:

	store_cd	amount
1	S12013	330.194130
5	S13001	348.470386
7	S13003	350.915519
8	S13004	330.943949
12	S13015	351.111960
16	S13019	330.208616
17	S13020	337.879932
28	S13052	402.867470
30	S14010	348.791262
31	S14011	335.718333
38	S14026	332.340588
46	S14045	330.082073
48	S14047	330.077073

Commentary :

The code df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330') groups the df_receipt DataFrame by the store_cd column, calculates the mean value of the amount column for each group, resets the index of the resulting DataFrame, and then filters the resulting DataFrame to only include rows where the mean value of amount is greater than or equal to 330.

Here's an explanation of each component:

df_receipt.groupby('store_cd'): This groups the df_receipt DataFrame by the values in the store_cd column.

.amount.mean(): This calculates the mean value of the amount column for each group.

.reset_index(): This resets the index of the resulting DataFrame.

.query('amount >= 330'): This filters the resulting DataFrame to only include rows where the mean value of amount is greater than or equal to 330.

In summary, this code selects only the stores whose average transaction amount is greater than or equal to 330 yen.

P-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

# # Code example 1: Writing without query
df_receipt[~df_receipt['customer_id'].str.startswith("Z")]. \
                            groupby('customer_id').amount.sum().mean()

Out[ ]:

2547.742234529256

Commentary : 

The code df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean() does the following:

df_receipt[~df_receipt['customer_id'].str.startswith("Z")]: This filters the df_receipt DataFrame to exclude rows where the customer_id column starts with the letter "Z". The tilde (~) character is used as a negation operator to invert the boolean values returned by the str.startswith() method.

.groupby('customer_id').amount.sum(): This groups the filtered DataFrame by the customer_id column, and calculates the sum of the amount column for each group.

.mean(): This calculates the mean value of the resulting Series, which represents the total amount spent by each customer (excluding customers whose customer_id starts with "Z").

In summary, this code computes the average amount spent by each customer whose customer_id does not start with the letter "Z" in the df_receipt DataFrame. This may be useful for understanding the purchasing behavior of regular customers.

In [ ]:

# Code example 2: Writing with query
df_receipt.query('not customer_id.str.startswith("Z")', 
                 engine='python').groupby('customer_id').amount.sum().mean()

Out[ ]:

2547.742234529256

Commentary : 

This code computes the average amount spent by non-anonymous customers. Here's how it works:

df_receipt is a DataFrame containing information about transactions.

The query method is used to filter out transactions where the customer_id starts with the letter "Z". This is done using the str.startswith() method, which checks if a string starts with a certain character or substring.

The resulting DataFrame is grouped by customer_id using the groupby method.

The amount column is selected using the sum method to compute the total amount spent by each customer.

The mean method is then used to compute the average of the amounts computed in the previous step. This is the final result of the code.

P-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.

In [ ]:

df_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].\
                                    groupby('customer_id').amount.sum()

amount_mean = df_amount_sum.mean()

df_amount_sum = df_amount_sum.reset_index()

df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)

Out[ ]:

	customer_id	amount
2	CS001115000010	3044
4	CS001205000006	3337
13	CS001214000009	4685
14	CS001214000017	4132
17	CS001214000052	5639
21	CS001215000040	3496
30	CS001304000006	3726
32	CS001305000005	3485
33	CS001305000011	4370
53	CS001315000180	3300

Commentary :

The code df_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum(), amount_mean = df_amount_sum.mean(), df_amount_sum = df_amount_sum.reset_index(), and df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10) does the following:

df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum(): This filters the df_receipt DataFrame to exclude rows where the customer_id column starts with the letter "Z", groups the resulting DataFrame by the customer_id column, and calculates the sum of the amount column for each group. The resulting object is a Series that represents the total amount spent by each customer (excluding customers whose customer_id starts with "Z").

amount_mean = df_amount_sum.mean(): This calculates the mean value of the df_amount_sum Series, which represents the average amount spent by each customer whose customer_id does not start with the letter "Z" in the df_receipt DataFrame.

df_amount_sum = df_amount_sum.reset_index(): This resets the index of the df_amount_sum Series and converts it to a DataFrame with two columns: customer_id and amount.

df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10): This filters the df_amount_sum DataFrame to include only rows where the amount column is greater than or equal to amount_mean, and returns the first 10 rows of the resulting DataFrame.

In summary, this code selects the customers whose total spending is greater than or equal to the average spending of all customers whose customer_id does not start with the letter "Z" in the df_receipt DataFrame. The resulting DataFrame contains the customer_id and total amount spent for each selected customer, sorted by descending order of total spending.

P-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.

In [ ]:

pd.merge(df_receipt, df_store[['store_cd','store_name']], 
         how='inner', on='store_cd').head(10)

Out[ ]:

	sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount	store_name
0	20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158	葛が谷店
1	20181116	1542326400	S14006	112	2	ZZ000000000000	P080401001	1	48	葛が谷店
2	20170118	1484697600	S14006	1162	1	CS006815000006	P050406035	1	220	葛が谷店
3	20190524	1558656000	S14006	1192	1	CS006514000034	P060104003	1	80	葛が谷店
4	20190419	1555632000	S14006	112	2	ZZ000000000000	P060501002	1	148	葛が谷店
5	20181119	1542585600	S14006	1152	2	ZZ000000000000	P050701001	1	88	葛が谷店
6	20171211	1512950400	S14006	1132	2	CS006515000175	P090903001	1	80	葛が谷店
7	20191021	1571616000	S14006	1112	2	CS006415000221	P040602001	1	405	葛が谷店
8	20170710	1499644800	S14006	1132	2	CS006411000036	P090301051	1	330	葛が谷店
9	20190805	1564963200	S14006	112	1	CS006211000012	P050104001	1	115	葛が谷店

Commentary :

The code pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10) does the following:

df_store[['store_cd','store_name']]: This selects a subset of the df_store DataFrame with two columns: store_cd and store_name.

pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd'): This performs an inner join between the df_receipt DataFrame and the df_store[['store_cd','store_name']] DataFrame on the store_cd column. This will merge the two DataFrames on the store_cd column, and only include rows that have matching values in both DataFrames.

.head(10): This returns the first 10 rows of the resulting merged DataFrame.

In summary, this code merges the df_receipt DataFrame with the df_store DataFrame on the store_cd column, and adds the store_name column from the df_store DataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the stores where the transactions in df_receipt occurred, including the store_cd, store_name, and all other columns from the df_receipt DataFrame.

P-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.

In [ ]:

pd.merge(df_product
         , df_category[['category_small_cd','category_small_name']]
         , how='inner', on='category_small_cd').head(10)

Out[ ]:

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_small_name
0	P040101001	04	0401	040101	198.0	149.0	弁当類
1	P040101002	04	0401	040101	218.0	164.0	弁当類
2	P040101003	04	0401	040101	230.0	173.0	弁当類
3	P040101004	04	0401	040101	248.0	186.0	弁当類
4	P040101005	04	0401	040101	268.0	201.0	弁当類
5	P040101006	04	0401	040101	298.0	224.0	弁当類
6	P040101007	04	0401	040101	338.0	254.0	弁当類
7	P040101008	04	0401	040101	420.0	315.0	弁当類
8	P040101009	04	0401	040101	498.0	374.0	弁当類
9	P040101010	04	0401	040101	580.0	435.0	弁当類

Commentary :

The code pd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd').head(10) does the following:

df_category[['category_small_cd','category_small_name']]: This selects a subset of the df_category DataFrame with two columns: category_small_cd and category_small_name.

pd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd'): This performs an inner join between the df_product DataFrame and the df_category[['category_small_cd','category_small_name']] DataFrame on the category_small_cd column. This will merge the two DataFrames on the category_small_cd column, and only include rows that have matching values in both DataFrames.

.head(10): This returns the first 10 rows of the resulting merged DataFrame.

In summary, this code merges the df_product DataFrame with the df_category DataFrame on the category_small_cd column, and adds the category_small_name column from the df_category DataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the products in df_product, including the category_small_cd, category_small_name, and all other columns from the df_product DataFrame.

P-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.

In [ ]:

df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()

df_tmp = df_customer. \
            query('gender_cd == "1" and not customer_id.str.startswith("Z")', 
                  engine='python')

pd.merge(df_tmp['customer_id'], df_amount_sum, 
         how='left', on='customer_id').fillna(0).head(10)

Out[ ]:

	customer_id	amount
0	CS021313000114	0.0
1	CS031415000172	5088.0
2	CS028811000001	0.0
3	CS001215000145	875.0
4	CS015414000103	3122.0
5	CS033513000180	868.0
6	CS035614000014	0.0
7	CS011215000048	3444.0
8	CS009413000079	0.0
9	CS040412000191	210.0

Commentary :

The code df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index() df_tmp = df_customer. query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python') pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10) does the following:

df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index(): This creates a new DataFrame df_amount_sum by grouping the df_receipt DataFrame by customer_id and summing the amount column for each customer. The resulting DataFrame has two columns: customer_id and amount.

df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python'): This selects a subset of the df_customer DataFrame where gender_cd equals 1 and customer_id does not start with "Z". The engine='python' argument is used to allow the use of string methods like startswith.

pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id'): This merges the customer_id column from the df_tmp DataFrame with the df_amount_sum DataFrame on the customer_id column using a left join. This includes all rows from the df_tmp DataFrame and adds the amount column from the df_amount_sum DataFrame, where available.

.fillna(0): This fills any missing values in the amount column with 0.

.head(10): This returns the first 10 rows of the resulting merged DataFrame.

In summary, this code calculates the total amount spent by each customer in the df_receipt DataFrame, and then selects a subset of customers from the df_customer DataFrame who are male and have valid customer_id values. The code then merges the two DataFrames on the customer_id column using a left join, and includes the total amount spent by each customer in the resulting DataFrame. Any missing values in the amount column are filled with 0. The resulting DataFrame contains the customer_id and amount columns for the selected male customers.

P-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).

In [ ]:

df_data = df_receipt \
          .query('not customer_id.str.startswith("Z")', engine='python')

df_cnt = df_data[~df_data.duplicated(subset=['customer_id', 'sales_ymd'])] \
         .groupby('customer_id').sales_ymd.count().reset_index() \
         .sort_values('sales_ymd', ascending=False).head(20)

df_sum = df_data.groupby('customer_id').amount.sum().reset_index() \
         .sort_values('amount', ascending=False).head(20)

pd.merge(df_cnt, df_sum, how='outer', on='customer_id')

Out[ ]:

	customer_id	sales_ymd	amount
0	CS040214000008	23.0	NaN
1	CS015415000185	22.0	20153.0
2	CS010214000010	22.0	18585.0
3	CS010214000002	21.0	NaN
4	CS028415000007	21.0	19127.0
5	CS017415000097	20.0	23086.0
6	CS016415000141	20.0	18372.0
7	CS031414000051	19.0	19202.0
8	CS014214000023	19.0	NaN
9	CS022515000226	19.0	NaN
10	CS021515000172	19.0	NaN
11	CS039414000052	19.0	NaN
12	CS021514000045	19.0	NaN
13	CS022515000028	18.0	NaN
14	CS030214000008	18.0	NaN
15	CS021515000056	18.0	NaN
16	CS014415000077	18.0	NaN
17	CS021515000211	18.0	NaN
18	CS032415000209	18.0	NaN
19	CS031414000073	18.0	NaN
20	CS001605000009	NaN	18925.0
21	CS006515000023	NaN	18372.0
22	CS011414000106	NaN	18338.0
23	CS038415000104	NaN	17847.0
24	CS035414000024	NaN	17615.0
25	CS021515000089	NaN	17580.0
26	CS032414000072	NaN	16563.0
27	CS016415000101	NaN	16348.0
28	CS011415000006	NaN	16094.0
29	CS034415000047	NaN	16083.0
30	CS007514000094	NaN	15735.0
31	CS009414000059	NaN	15492.0
32	CS030415000034	NaN	15468.0
33	CS015515000034	NaN	15300.0

Commentary :

This code is performing the following steps:

Filtering the rows in the df_receipt dataframe where the customer_id column does not start with "Z" using the .query() method and saving it to df_data.

Removing duplicate rows from df_data based on customer_id and sales_ymd columns, then grouping by customer_id, counting the number of sales dates and saving it to df_cnt.

Sorting df_cnt by the sales_ymd column in descending order and taking the top 20 rows.

Grouping df_data by customer_id and summing the amount column, then sorting by amount column in descending order and taking the top 20 rows, and saving it to df_sum.

Merging df_cnt and df_sum dataframes on the customer_id column using the pd.merge() method with how='outer' parameter, which means that all records from both dataframes will be included in the merged dataframe, and saving it to df_data. The resulting dataframe will have the top 20 customers based on the count of sales dates and the sum of their purchase amount.

P-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.

In [ ]:

df_store_tmp = df_store.copy()
df_product_tmp = df_product.copy()

df_store_tmp['key'] = 0
df_product_tmp['key'] = 0

len(pd.merge(df_store_tmp, df_product_tmp, how='outer', on='key'))

Out[ ]:

Commentary :

This code merges the two dataframes df_store and df_product using an outer join.

First, two temporary dataframes df_store_tmp and df_product_tmp are created as copies of df_store and df_product, respectively. Then, a new column called key is added to both dataframes with a constant value of 0. This is done because an outer join requires a common column to merge on.

Finally, the two dataframes are merged using the pd.merge() function, with how='outer' argument indicating an outer join is desired and on='key' specifying the common column to merge on. The resulting dataframe contains all rows from both dataframes, with missing values (NaN) in the cells where there was no corresponding data in the other dataframe.

Lastly, len() is used to get the number of rows in the merged dataframe. Since both dataframes have the same number of rows, the output would be the same whether the inner, left, right, or outer join was used. In this case, it will output the total number of rows in the merged dataframe.

Data Science 100 Knocks (Structured Data Processing) - Python

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.