参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
- 
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
 - 
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
 - 
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
 
Getting Started
- First, run the following cells
 - Import the necessary libraries and read data from the database (PostgreSQL)
 - The following cell imports libraries that are expected to be used, such as pandas.
 - If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
 - The process can be divided into multiple times.
 - Names, addresses, etc. are dummy data and are not real.
 
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
if 'PG_PORT' in os.environ:
    pgconfig = {
        'host': 'db',
        'port': os.environ['PG_PORT'],
        'database': os.environ['PG_DATABASE'],
        'user': os.environ['PG_USER'],
        'password': os.environ['PG_PASSWORD'],
    }
    # Connector for pd.read_sql
    conn = psycopg2.connect(**pgconfig)
    df_customer = pd.read_sql(sql='select * from customer', con=conn)
    df_category = pd.read_sql(sql='select * from category', con=conn)
    df_product = pd.read_sql(sql='select * from product', con=conn)
    df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
    df_store = pd.read_sql(sql='select * from store', con=conn)
    df_geocode = pd.read_sql(sql='select * from geocode', con=conn)
else:
    if not os.path.exists('data/'):
        !git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
        os.chdir('100knocks-preprocess/docker/work/')
    dtype = {
        'customer_id': str,
        'gender_cd': str,
        'postal_cd': str,
        'application_store_cd': str,
        'status_cd': str,
        'category_major_cd': str,
        'category_medium_cd': str,
        'category_small_cd': str,
        'product_cd': str,
        'store_cd': str,
        'prefecture_cd': str,
        'tel_no': str,
        'postal_cd': str,
        'street': str
    }
    df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
    df_category = pd.read_csv("data/category.csv", dtype=dtype)
    df_product = pd.read_csv("data/product.csv", dtype=dtype)
    df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
    df_store = pd.read_csv("data/store.csv", dtype=dtype)
    df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)
fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.
Exercise
P-021: Count the number of cases against the receipt details data (df_receipt).
len(df_receipt)
104681
Commentary :
The codelen(df_receipt)returns the length of the DataFramedf_receipt.
Here's an explanation of each component:len(): This is a built-in Python function that returns the length of an object, such as a list or a string.df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name.
The combination oflen()anddf_receipt: By passing the DataFramedf_receiptas an argument to thelen()function, we are asking Python to return the number of rows indf_receipt.
In summary, the code is simply returning the number of rows in a DataFrame.
P-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).
len(df_receipt['customer_id'].unique())
8307
Commentary :
The codelen(df_receipt['customer_id'].unique())returns the number of unique customer IDs in thecustomer_idcolumn of thedf_receiptDataFrame.
Here's an explanation of each component:len(): This is a built-in Python function that returns the length of an object, such as a list or a string.df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name.['customer_id']: This is a DataFrame indexing operation that selects a column with the labelcustomer_idfromdf_receipt. The resulting object is a pandas Series..unique(): This is a pandas Series method that returns a numpy array of unique values in the Series. In this case, it returns a numpy array of unique customer IDs.
The combination oflen(),df_receipt['customer_id'].unique(), and(): By passing the numpy array of unique customer IDs as an argument to thelen()function, we are asking Python to return the number of unique customer IDs.
In summary, the code is returning the number of unique customer IDs in a DataFrame.
P-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).
# Code example 1
df_receipt.groupby('store_cd').agg({'amount':'sum', 
                                    'quantity':'sum'}).reset_index()
| store_cd | amount | quantity | |
|---|---|---|---|
| 0 | S12007 | 638761 | 2099 | 
| 1 | S12013 | 787513 | 2425 | 
| 2 | S12014 | 725167 | 2358 | 
| 3 | S12029 | 794741 | 2555 | 
| 4 | S12030 | 684402 | 2403 | 
| 5 | S13001 | 811936 | 2347 | 
| 6 | S13002 | 727821 | 2340 | 
| 7 | S13003 | 764294 | 2197 | 
| 8 | S13004 | 779373 | 2390 | 
| 9 | S13005 | 629876 | 2004 | 
| 10 | S13008 | 809288 | 2491 | 
| 11 | S13009 | 808870 | 2486 | 
| 12 | S13015 | 780873 | 2248 | 
| 13 | S13016 | 793773 | 2432 | 
| 14 | S13017 | 748221 | 2376 | 
| 15 | S13018 | 790535 | 2562 | 
| 16 | S13019 | 827833 | 2541 | 
| 17 | S13020 | 796383 | 2383 | 
| 18 | S13031 | 705968 | 2336 | 
| 19 | S13032 | 790501 | 2491 | 
| 20 | S13035 | 715869 | 2219 | 
| 21 | S13037 | 693087 | 2344 | 
| 22 | S13038 | 708884 | 2337 | 
| 23 | S13039 | 611888 | 1981 | 
| 24 | S13041 | 728266 | 2233 | 
| 25 | S13043 | 587895 | 1881 | 
| 26 | S13044 | 520764 | 1729 | 
| 27 | S13051 | 107452 | 354 | 
| 28 | S13052 | 100314 | 250 | 
| 29 | S14006 | 712839 | 2284 | 
| 30 | S14010 | 790361 | 2290 | 
| 31 | S14011 | 805724 | 2434 | 
| 32 | S14012 | 720600 | 2412 | 
| 33 | S14021 | 699511 | 2231 | 
| 34 | S14022 | 651328 | 2047 | 
| 35 | S14023 | 727630 | 2258 | 
| 36 | S14024 | 736323 | 2417 | 
| 37 | S14025 | 755581 | 2394 | 
| 38 | S14026 | 824537 | 2503 | 
| 39 | S14027 | 714550 | 2303 | 
| 40 | S14028 | 786145 | 2458 | 
| 41 | S14033 | 725318 | 2282 | 
| 42 | S14034 | 653681 | 2024 | 
| 43 | S14036 | 203694 | 635 | 
| 44 | S14040 | 701858 | 2233 | 
| 45 | S14042 | 534689 | 1935 | 
| 46 | S14045 | 458484 | 1398 | 
| 47 | S14046 | 412646 | 1354 | 
| 48 | S14047 | 338329 | 1041 | 
| 49 | S14048 | 234276 | 769 | 
| 50 | S14049 | 230808 | 788 | 
| 51 | S14050 | 167090 | 580 | 
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'sum', 'quantity':'sum'}).reset_index()groups the rows of thedf_receiptDataFrame by thestore_cdcolumn, and then applies two aggregation functions (sum()onamountandquantity) to the corresponding columns in each group. Finally, it resets the index of the resulting DataFrame.
Here's an explanation of each component:df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name..groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cdcolumn. This creates aGroupByobject that can be used to apply functions to each group separately..agg({'amount':'sum', 'quantity':'sum'}): This is a pandas GroupBy object method that applies thesum()aggregation function to theamountandquantitycolumns of each group in theGroupByobject. The result is a DataFrame with two columns,amountandquantity, that contain the sums of these columns for each group..reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then computing the sum of two other columns for each group, returning a DataFrame with the sums of these columns for each group and a default index.
# Code example 2
df_receipt.groupby('store_cd')[['amount','quantity']].agg('sum').reset_index()
| store_cd | amount | quantity | |
|---|---|---|---|
| 0 | S12007 | 638761 | 2099 | 
| 1 | S12013 | 787513 | 2425 | 
| 2 | S12014 | 725167 | 2358 | 
| 3 | S12029 | 794741 | 2555 | 
| 4 | S12030 | 684402 | 2403 | 
| 5 | S13001 | 811936 | 2347 | 
| 6 | S13002 | 727821 | 2340 | 
| 7 | S13003 | 764294 | 2197 | 
| 8 | S13004 | 779373 | 2390 | 
| 9 | S13005 | 629876 | 2004 | 
| 10 | S13008 | 809288 | 2491 | 
| 11 | S13009 | 808870 | 2486 | 
| 12 | S13015 | 780873 | 2248 | 
| 13 | S13016 | 793773 | 2432 | 
| 14 | S13017 | 748221 | 2376 | 
| 15 | S13018 | 790535 | 2562 | 
| 16 | S13019 | 827833 | 2541 | 
| 17 | S13020 | 796383 | 2383 | 
| 18 | S13031 | 705968 | 2336 | 
| 19 | S13032 | 790501 | 2491 | 
| 20 | S13035 | 715869 | 2219 | 
| 21 | S13037 | 693087 | 2344 | 
| 22 | S13038 | 708884 | 2337 | 
| 23 | S13039 | 611888 | 1981 | 
| 24 | S13041 | 728266 | 2233 | 
| 25 | S13043 | 587895 | 1881 | 
| 26 | S13044 | 520764 | 1729 | 
| 27 | S13051 | 107452 | 354 | 
| 28 | S13052 | 100314 | 250 | 
| 29 | S14006 | 712839 | 2284 | 
| 30 | S14010 | 790361 | 2290 | 
| 31 | S14011 | 805724 | 2434 | 
| 32 | S14012 | 720600 | 2412 | 
| 33 | S14021 | 699511 | 2231 | 
| 34 | S14022 | 651328 | 2047 | 
| 35 | S14023 | 727630 | 2258 | 
| 36 | S14024 | 736323 | 2417 | 
| 37 | S14025 | 755581 | 2394 | 
| 38 | S14026 | 824537 | 2503 | 
| 39 | S14027 | 714550 | 2303 | 
| 40 | S14028 | 786145 | 2458 | 
| 41 | S14033 | 725318 | 2282 | 
| 42 | S14034 | 653681 | 2024 | 
| 43 | S14036 | 203694 | 635 | 
| 44 | S14040 | 701858 | 2233 | 
| 45 | S14042 | 534689 | 1935 | 
| 46 | S14045 | 458484 | 1398 | 
| 47 | S14046 | 412646 | 1354 | 
| 48 | S14047 | 338329 | 1041 | 
| 49 | S14048 | 234276 | 769 | 
| 50 | S14049 | 230808 | 788 | 
| 51 | S14050 | 167090 | 580 | 
Commentary :
This code performs a groupby operation on thedf_receiptdataframe using thestore_cdcolumn, and then aggregates theamountandquantitycolumns by summing them up.
The resulting dataframe will have one row for each unique value in thestore_cdcolumn, and two columns:amountandquantity, both of which contain the sum of their respective values across all rows that have the samestore_cdvalue.
The.reset_index()method is then called on the resulting dataframe to convert thestore_cdcolumn from an index to a regular column.
P-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).
df_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)
| customer_id | sales_ymd | |
|---|---|---|
| 0 | CS001113000004 | 20190308 | 
| 1 | CS001114000005 | 20190731 | 
| 2 | CS001115000010 | 20190405 | 
| 3 | CS001205000004 | 20190625 | 
| 4 | CS001205000006 | 20190224 | 
| 5 | CS001211000025 | 20190322 | 
| 6 | CS001212000027 | 20170127 | 
| 7 | CS001212000031 | 20180906 | 
| 8 | CS001212000046 | 20170811 | 
| 9 | CS001212000070 | 20191018 | 
Commentary :
The code is performing groupby operation on a DataFrame calleddf_receiptbased on thecustomer_idcolumn. Then it is aggregating the values of thesales_ymdcolumn for each group using themaxfunction to find the latest date of purchase for each customer.
The result of the aggregation is a new DataFrame with two columns -customer_idandsales_ymd. Thereset_indexfunction is then used to reset the index of the resulting DataFrame to default and thehead(10)function is used to display the first 10 rows of the resulting DataFrame.
So, the output of this code will be a DataFrame showing the maximum purchase date (sales_ymd) for eachcustomer_id, with only the first 10 rows displayed.
P-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.
# The same writing as 024 is also possible, but dare to introduce an example solution using a different writing style.
df_receipt.groupby('customer_id').sales_ymd.min().reset_index().head(10)
| customer_id | sales_ymd | |
|---|---|---|
| 0 | CS001113000004 | 20190308 | 
| 1 | CS001114000005 | 20180503 | 
| 2 | CS001115000010 | 20171228 | 
| 3 | CS001205000004 | 20170914 | 
| 4 | CS001205000006 | 20180207 | 
| 5 | CS001211000025 | 20190322 | 
| 6 | CS001212000027 | 20170127 | 
| 7 | CS001212000031 | 20180906 | 
| 8 | CS001212000046 | 20170811 | 
| 9 | CS001212000070 | 20191018 | 
Commentary :
The codedf_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)groups the rows of thedf_receiptDataFrame by thecustomer_idcolumn, and then applies themax()aggregation function to thesales_ymdcolumn of each group. Finally, it resets the index of the resulting DataFrame and returns the first 10 rows.
Here's an explanation of each component:df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name..groupby('customer_id'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thecustomer_idcolumn. This creates aGroupByobject that can be used to apply functions to each group separately..agg({'sales_ymd': 'max'}): This is a pandas GroupBy object method that applies themax()aggregation function to thesales_ymdcolumn of each group in theGroupByobject. The result is a DataFrame with one column,sales_ymd, that contains the maximum value of thesales_ymdcolumn for each group..reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..head(10): This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum value of another column for each group, returning a DataFrame with the maximum values and a default index, and then returning the first 10 rows of the DataFrame. This code is useful for identifying the most recent purchase date for each customer.
P-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.
df_tmp = df_receipt.groupby('customer_id'). \
                agg({'sales_ymd':['max','min']}).reset_index()
# Make multi-index (item) hierarchies into a single-level index (item) while connecting them with "_".
# You can also use df_tmp.columns = ['customer_id', 'sales_ymd_max', 'sales_ymd_min'].
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]
df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)
| customer_id_ | sales_ymd_max | sales_ymd_min | |
|---|---|---|---|
| 1 | CS001114000005 | 20190731 | 20180503 | 
| 2 | CS001115000010 | 20190405 | 20171228 | 
| 3 | CS001205000004 | 20190625 | 20170914 | 
| 4 | CS001205000006 | 20190224 | 20180207 | 
| 13 | CS001214000009 | 20190902 | 20170306 | 
| 14 | CS001214000017 | 20191006 | 20180828 | 
| 16 | CS001214000048 | 20190929 | 20171109 | 
| 17 | CS001214000052 | 20190617 | 20180208 | 
| 20 | CS001215000005 | 20181021 | 20170206 | 
| 21 | CS001215000040 | 20171022 | 20170214 | 
Commentary :
The codedf_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index() df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)groups the rows of thedf_receiptDataFrame by thecustomer_idcolumn, and then applies two aggregation functions (max()andmin()) to thesales_ymdcolumn of each group, returning a DataFrame with the maximum and minimum purchase dates for each customer. Then, it selects only the rows where the maximum and minimum purchase dates are different, using thequery()method, and returns the first 10 rows.
Here's an explanation of each component:df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name..groupby('customer_id'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thecustomer_idcolumn. This creates aGroupByobject that can be used to apply functions to each group separately..agg({'sales_ymd':['max','min']}): This is a pandas GroupBy object method that applies two aggregation functions (max()andmin()) to thesales_ymdcolumn of each group in theGroupByobject. The result is a DataFrame with two columns,sales_ymd_maxandsales_ymd_min, that contain the maximum and minimum values of thesales_ymdcolumn for each group, respectively..reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..query('sales_ymd_max != sales_ymd_min'): This is a pandas DataFrame method that selects only the rows of the DataFrame where the value in thesales_ymd_maxcolumn is not equal to the value in thesales_ymd_mincolumn. This is done using a boolean expression in a string that is passed as an argument to thequery()method..head(10): This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum and minimum values of another column for each group, returning a DataFrame with the maximum and minimum values and a default index. It then selects only the rows where the maximum and minimum values are different and returns the first 10 rows of the resulting DataFrame. This code is useful for identifying customers who have made purchases on different dates.
P-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.
df_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index(). \
                            sort_values('amount', ascending=False).head(5)
| store_cd | amount | |
|---|---|---|
| 28 | S13052 | 402.867470 | 
| 12 | S13015 | 351.111960 | 
| 7 | S13003 | 350.915519 | 
| 30 | S14010 | 348.791262 | 
| 5 | S13001 | 348.470386 | 
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index().sort_values('amount', ascending=False).head(5)groups the rows of thedf_receiptDataFrame by thestore_cdcolumn, and then calculates the mean of theamountcolumn for each group, returning a DataFrame with the mean values and thestore_cdcolumn. It then resets the index of the resulting DataFrame and sorts it by the meanamountvalues in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.
Here's an explanation of each component:df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name..groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cdcolumn. This creates aGroupByobject that can be used to apply functions to each group separately..agg({'amount':'mean'}): This is a pandas GroupBy object method that applies themean()aggregation function to theamountcolumn of each group in theGroupByobject. The result is a DataFrame with one column,amount, that contains the mean value of theamountcolumn for each group..reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False): This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamountcolumn, in descending order. Theascending=Falseparameter specifies that the sort order should be descending..head(5): This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mean value of another column for each group, returning a DataFrame with the mean values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the mean values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest average sales.
P-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.
df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index(). \
                            sort_values('amount', ascending=False).head(5)
| store_cd | amount | |
|---|---|---|
| 28 | S13052 | 190.0 | 
| 30 | S14010 | 188.0 | 
| 51 | S14050 | 185.0 | 
| 44 | S14040 | 180.0 | 
| 7 | S13003 | 180.0 | 
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5)groups the rows of thedf_receiptDataFrame by thestore_cdcolumn, and then calculates the median of theamountcolumn for each group, returning a DataFrame with the median values and thestore_cdcolumn. It then resets the index of the resulting DataFrame and sorts it by the medianamountvalues in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.
Here's an explanation of each component:df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name..groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cdcolumn. This creates aGroupByobject that can be used to apply functions to each group separately..agg({'amount':'median'}): This is a pandas GroupBy object method that applies themedian()aggregation function to theamountcolumn of each group in theGroupByobject. The result is a DataFrame with one column,amount, that contains the median value of theamountcolumn for each group..reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False): This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamountcolumn, in descending order. Theascending=Falseparameter specifies that the sort order should be descending..head(5): This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the median value of another column for each group, returning a DataFrame with the median values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the median values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest median sales.
P-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.
df_receipt.groupby('store_cd').product_cd. \
            apply(lambda x: x.mode()).reset_index().head(10)
| store_cd | level_1 | product_cd | |
|---|---|---|---|
| 0 | S12007 | 0 | P060303001 | 
| 1 | S12013 | 0 | P060303001 | 
| 2 | S12014 | 0 | P060303001 | 
| 3 | S12029 | 0 | P060303001 | 
| 4 | S12030 | 0 | P060303001 | 
| 5 | S13001 | 0 | P060303001 | 
| 6 | S13002 | 0 | P060303001 | 
| 7 | S13003 | 0 | P071401001 | 
| 8 | S13004 | 0 | P060303001 | 
| 9 | S13005 | 0 | P040503001 | 
Commentary :
The codedf_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index().head(10)groups the rows of thedf_receiptDataFrame by thestore_cdcolumn, and then finds the mode of theproduct_cdcolumn for each group. It then returns a DataFrame with the mode values and thestore_cdcolumn. Thereset_index()method resets the index of the resulting DataFrame to a default index.
Here's an explanation of each component:df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name..groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cdcolumn. This creates aGroupByobject that can be used to apply functions to each group separately..product_cd: This is a pandas DataFrame attribute that accesses theproduct_cdcolumn of the DataFrame..apply(lambda x: x.mode()): This is a pandas GroupBy object method that applies themode()function to each group in theGroupByobject. Themode()function returns the most frequently occurring value(s) in each group. The resulting object is a Series with the same length as the original group..reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mode value of another column for each group, returning a DataFrame with the mode values and the grouping column. Themode()function is used to find the most frequently occurring product code for each store. This code is useful for identifying the most popular products sold at each store.
P-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index(). \
                            sort_values('amount', ascending=False).head(5)
| store_cd | amount | |
|---|---|---|
| 28 | S13052 | 440088.701311 | 
| 31 | S14011 | 306314.558164 | 
| 42 | S14034 | 296920.081011 | 
| 5 | S13001 | 295431.993329 | 
| 12 | S13015 | 295294.361116 | 
Commentary :
The codedf_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)groups the rows of thedf_receiptDataFrame by thestore_cdcolumn, and then calculates the variance of theamountcolumn for each group, withddof=0. It returns a DataFrame with the variance values and thestore_cdcolumn. Thereset_index()method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the variance values in descending order and returns the first 5 rows.
Here's an explanation of each component:df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name..groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cdcolumn. This creates aGroupByobject that can be used to apply functions to each group separately..amount.var(ddof=0): This is a pandas DataFrame method that calculates the variance of theamountcolumn for each group in theGroupByobject, withddof=0. Theddofparameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample variance. A value of0forddofindicates that the population variance should be calculated, rather than the sample variance. The result is a Series with the variance values and thestore_cdvalues as the index..reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False): This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamountcolumn, in descending order. Theascending=Falseparameter specifies that the sort order should be descending..head(5): This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population variance of another column for each group, returning a DataFrame with the variance values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the variance values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.
P-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.
TIPS:
Note that the default values for ddof are different in Pandas and Numpy
Pandas:
DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Numpy:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)
df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index(). \
                            sort_values('amount', ascending=False).head(5)
| store_cd | amount | |
|---|---|---|
| 28 | S13052 | 663.391816 | 
| 31 | S14011 | 553.456916 | 
| 42 | S14034 | 544.903736 | 
| 5 | S13001 | 543.536561 | 
| 12 | S13015 | 543.409938 | 
Commentary :
The codedf_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)groups the rows of thedf_receiptDataFrame by thestore_cdcolumn, and then calculates the standard deviation of theamountcolumn for each group, withddof=0. It returns a DataFrame with the standard deviation values and thestore_cdcolumn. Thereset_index()method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the standard deviation values in descending order and returns the first 5 rows.
Here's an explanation of each component:df_receipt: This is a variable that represents a DataFrame object in pandas. The namedf_receiptcould be any valid variable name..groupby('store_cd'): This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cdcolumn. This creates aGroupByobject that can be used to apply functions to each group separately..amount.std(ddof=0): This is a pandas DataFrame method that calculates the standard deviation of theamountcolumn for each group in theGroupByobject, withddof=0. Theddofparameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample standard deviation. A value of0forddofindicates that the population standard deviation should be calculated, rather than the sample standard deviation. The result is a Series with the standard deviation values and thestore_cdvalues as the index..reset_index(): This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False): This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamountcolumn, in descending order. Theascending=Falseparameter specifies that the sort order should be descending..head(5): This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population standard deviation of another column for each group, returning a DataFrame with the standard deviation values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the standard deviation values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.
P-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.
# Code Example 1
np.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)
array([ 102., 170., 288., 10925.])
Commentary :
The codenp.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)computes the quartiles of the values in theamountcolumn of thedf_receiptDataFrame using thenp.percentile()function from the NumPy library.
Here's an explanation of each component:np.percentile(): This is a NumPy function that computes the percentiles of an array. It takes two arguments: the first argument is the array for which percentiles are to be calculated, and the second argument is the percentile values to compute. Theqparameter is used to specify the percentiles to compute. For example,q=np.arange(1, 5) * 25computes the percentiles at positions 25, 50, 75 in the distribution of the data. This is because thenp.arange(1, 5) * 25expression generates an array with values[25, 50, 75], which are the positions of the percentiles in the distribution.df_receipt['amount']: This is a pandas Series object that represents theamountcolumn of thedf_receiptDataFrame. Thedf_receiptDataFrame is assumed to exist in the current environment.q=np.arange(1, 5) * 25: This is a NumPy array that specifies the percentiles to compute. Thenp.arange(1, 5) * 25expression generates an array with values[25, 50, 75], which are the positions of the percentiles in the distribution.
The code computes the quartiles of theamountcolumn of thedf_receiptDataFrame, which are the values that divide the distribution into four equal parts. The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the 50th percentile (also known as the median), and the third quartile (Q3) is the 75th percentile. The output of the code is an array of the quartile values, which can be used to summarize the distribution of theamountcolumn.
# Code Example 2
df_receipt.amount.quantile(q=np.arange(1, 5) / 4)
0.25 102.0 0.50 170.0 0.75 288.0 1.00 10925.0 Name: amount, dtype: float64
Commentary :
This code calculates the quartiles of theamountcolumn in thedf_receiptdataframe using thequantile()function from pandas. Theqparameter specifies the quantiles to compute, which are the 1/4, 2/4 (median), 3/4, and 4/4 (maximum) quantiles.
The code calculates the quartiles by callingquantile()on theamountcolumn ofdf_receiptwith theqparameter set to an array of values[0.25, 0.5, 0.75, 1.0]which corresponds to the quartiles 1, 2, 3 and 4 respectively.
The resulting output of this code would be an array of four values, each representing the value of the corresponding quartile.
P-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.
df_receipt.groupby('store_cd').amount.mean(). \
                    reset_index().query('amount >= 330')
| store_cd | amount | |
|---|---|---|
| 1 | S12013 | 330.194130 | 
| 5 | S13001 | 348.470386 | 
| 7 | S13003 | 350.915519 | 
| 8 | S13004 | 330.943949 | 
| 12 | S13015 | 351.111960 | 
| 16 | S13019 | 330.208616 | 
| 17 | S13020 | 337.879932 | 
| 28 | S13052 | 402.867470 | 
| 30 | S14010 | 348.791262 | 
| 31 | S14011 | 335.718333 | 
| 38 | S14026 | 332.340588 | 
| 46 | S14045 | 330.082073 | 
| 48 | S14047 | 330.077073 | 
Commentary :
The codedf_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')groups thedf_receiptDataFrame by thestore_cdcolumn, calculates the mean value of theamountcolumn for each group, resets the index of the resulting DataFrame, and then filters the resulting DataFrame to only include rows where the mean value ofamountis greater than or equal to 330.
Here's an explanation of each component:df_receipt.groupby('store_cd'): This groups thedf_receiptDataFrame by the values in thestore_cdcolumn..amount.mean(): This calculates the mean value of theamountcolumn for each group..reset_index(): This resets the index of the resulting DataFrame..query('amount >= 330'): This filters the resulting DataFrame to only include rows where the mean value ofamountis greater than or equal to 330.
In summary, this code selects only the stores whose average transaction amount is greater than or equal to 330 yen.
P-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.
# # Code example 1: Writing without query
df_receipt[~df_receipt['customer_id'].str.startswith("Z")]. \
                            groupby('customer_id').amount.sum().mean()
2547.742234529256
Commentary :
The codedf_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()does the following:df_receipt[~df_receipt['customer_id'].str.startswith("Z")]: This filters thedf_receiptDataFrame to exclude rows where thecustomer_idcolumn starts with the letter "Z". The tilde (~) character is used as a negation operator to invert the boolean values returned by thestr.startswith()method..groupby('customer_id').amount.sum(): This groups the filtered DataFrame by thecustomer_idcolumn, and calculates the sum of theamountcolumn for each group..mean(): This calculates the mean value of the resulting Series, which represents the total amount spent by each customer (excluding customers whosecustomer_idstarts with "Z").
In summary, this code computes the average amount spent by each customer whosecustomer_iddoes not start with the letter "Z" in thedf_receiptDataFrame. This may be useful for understanding the purchasing behavior of regular customers.
# Code example 2: Writing with query
df_receipt.query('not customer_id.str.startswith("Z")', 
                 engine='python').groupby('customer_id').amount.sum().mean()
2547.742234529256
Commentary :
This code computes the average amount spent by non-anonymous customers. Here's how it works:df_receiptis a DataFrame containing information about transactions.
Thequerymethod is used to filter out transactions where thecustomer_idstarts with the letter "Z". This is done using thestr.startswith()method, which checks if a string starts with a certain character or substring.
The resulting DataFrame is grouped bycustomer_idusing thegroupbymethod.
Theamountcolumn is selected using thesummethod to compute the total amount spent by each customer.
Themeanmethod is then used to compute the average of the amounts computed in the previous step. This is the final result of the code.
P-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.
df_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].\
                                    groupby('customer_id').amount.sum()
amount_mean = df_amount_sum.mean()
df_amount_sum = df_amount_sum.reset_index()
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
| customer_id | amount | |
|---|---|---|
| 2 | CS001115000010 | 3044 | 
| 4 | CS001205000006 | 3337 | 
| 13 | CS001214000009 | 4685 | 
| 14 | CS001214000017 | 4132 | 
| 17 | CS001214000052 | 5639 | 
| 21 | CS001215000040 | 3496 | 
| 30 | CS001304000006 | 3726 | 
| 32 | CS001305000005 | 3485 | 
| 33 | CS001305000011 | 4370 | 
| 53 | CS001315000180 | 3300 | 
Commentary :
The codedf_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum(),amount_mean = df_amount_sum.mean(),df_amount_sum = df_amount_sum.reset_index(), anddf_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)does the following:df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum(): This filters thedf_receiptDataFrame to exclude rows where thecustomer_idcolumn starts with the letter "Z", groups the resulting DataFrame by thecustomer_idcolumn, and calculates the sum of theamountcolumn for each group. The resulting object is a Series that represents the total amount spent by each customer (excluding customers whosecustomer_idstarts with "Z").amount_mean = df_amount_sum.mean(): This calculates the mean value of thedf_amount_sumSeries, which represents the average amount spent by each customer whosecustomer_iddoes not start with the letter "Z" in thedf_receiptDataFrame.df_amount_sum = df_amount_sum.reset_index(): This resets the index of thedf_amount_sumSeries and converts it to a DataFrame with two columns:customer_idandamount.df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10): This filters thedf_amount_sumDataFrame to include only rows where theamountcolumn is greater than or equal toamount_mean, and returns the first 10 rows of the resulting DataFrame.
In summary, this code selects the customers whose total spending is greater than or equal to the average spending of all customers whosecustomer_iddoes not start with the letter "Z" in thedf_receiptDataFrame. The resulting DataFrame contains thecustomer_idand total amount spent for each selected customer, sorted by descending order of total spending.
P-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.
pd.merge(df_receipt, df_store[['store_cd','store_name']], 
         how='inner', on='store_cd').head(10)
| sales_ymd | sales_epoch | store_cd | receipt_no | receipt_sub_no | customer_id | product_cd | quantity | amount | store_name | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20181103 | 1541203200 | S14006 | 112 | 1 | CS006214000001 | P070305012 | 1 | 158 | 葛が谷店 | 
| 1 | 20181116 | 1542326400 | S14006 | 112 | 2 | ZZ000000000000 | P080401001 | 1 | 48 | 葛が谷店 | 
| 2 | 20170118 | 1484697600 | S14006 | 1162 | 1 | CS006815000006 | P050406035 | 1 | 220 | 葛が谷店 | 
| 3 | 20190524 | 1558656000 | S14006 | 1192 | 1 | CS006514000034 | P060104003 | 1 | 80 | 葛が谷店 | 
| 4 | 20190419 | 1555632000 | S14006 | 112 | 2 | ZZ000000000000 | P060501002 | 1 | 148 | 葛が谷店 | 
| 5 | 20181119 | 1542585600 | S14006 | 1152 | 2 | ZZ000000000000 | P050701001 | 1 | 88 | 葛が谷店 | 
| 6 | 20171211 | 1512950400 | S14006 | 1132 | 2 | CS006515000175 | P090903001 | 1 | 80 | 葛が谷店 | 
| 7 | 20191021 | 1571616000 | S14006 | 1112 | 2 | CS006415000221 | P040602001 | 1 | 405 | 葛が谷店 | 
| 8 | 20170710 | 1499644800 | S14006 | 1132 | 2 | CS006411000036 | P090301051 | 1 | 330 | 葛が谷店 | 
| 9 | 20190805 | 1564963200 | S14006 | 112 | 1 | CS006211000012 | P050104001 | 1 | 115 | 葛が谷店 | 
Commentary :
The codepd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)does the following:df_store[['store_cd','store_name']]: This selects a subset of thedf_storeDataFrame with two columns:store_cdandstore_name.pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd'): This performs an inner join between thedf_receiptDataFrame and thedf_store[['store_cd','store_name']]DataFrame on thestore_cdcolumn. This will merge the two DataFrames on thestore_cdcolumn, and only include rows that have matching values in both DataFrames..head(10): This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code merges thedf_receiptDataFrame with thedf_storeDataFrame on thestore_cdcolumn, and adds thestore_namecolumn from thedf_storeDataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the stores where the transactions indf_receiptoccurred, including thestore_cd,store_name, and all other columns from thedf_receiptDataFrame.
P-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.
pd.merge(df_product
         , df_category[['category_small_cd','category_small_name']]
         , how='inner', on='category_small_cd').head(10)
| product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_small_name | |
|---|---|---|---|---|---|---|---|
| 0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 弁当類 | 
| 1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 弁当類 | 
| 2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 弁当類 | 
| 3 | P040101004 | 04 | 0401 | 040101 | 248.0 | 186.0 | 弁当類 | 
| 4 | P040101005 | 04 | 0401 | 040101 | 268.0 | 201.0 | 弁当類 | 
| 5 | P040101006 | 04 | 0401 | 040101 | 298.0 | 224.0 | 弁当類 | 
| 6 | P040101007 | 04 | 0401 | 040101 | 338.0 | 254.0 | 弁当類 | 
| 7 | P040101008 | 04 | 0401 | 040101 | 420.0 | 315.0 | 弁当類 | 
| 8 | P040101009 | 04 | 0401 | 040101 | 498.0 | 374.0 | 弁当類 | 
| 9 | P040101010 | 04 | 0401 | 040101 | 580.0 | 435.0 | 弁当類 | 
Commentary :
The codepd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd').head(10)does the following:df_category[['category_small_cd','category_small_name']]: This selects a subset of thedf_categoryDataFrame with two columns:category_small_cdandcategory_small_name.pd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd'): This performs an inner join between thedf_productDataFrame and thedf_category[['category_small_cd','category_small_name']]DataFrame on thecategory_small_cdcolumn. This will merge the two DataFrames on thecategory_small_cdcolumn, and only include rows that have matching values in both DataFrames..head(10): This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code merges thedf_productDataFrame with thedf_categoryDataFrame on thecategory_small_cdcolumn, and adds thecategory_small_namecolumn from thedf_categoryDataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the products indf_product, including thecategory_small_cd,category_small_name, and all other columns from thedf_productDataFrame.
P-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_tmp = df_customer. \
            query('gender_cd == "1" and not customer_id.str.startswith("Z")', 
                  engine='python')
pd.merge(df_tmp['customer_id'], df_amount_sum, 
         how='left', on='customer_id').fillna(0).head(10)
| customer_id | amount | |
|---|---|---|
| 0 | CS021313000114 | 0.0 | 
| 1 | CS031415000172 | 5088.0 | 
| 2 | CS028811000001 | 0.0 | 
| 3 | CS001215000145 | 875.0 | 
| 4 | CS015414000103 | 3122.0 | 
| 5 | CS033513000180 | 868.0 | 
| 6 | CS035614000014 | 0.0 | 
| 7 | CS011215000048 | 3444.0 | 
| 8 | CS009413000079 | 0.0 | 
| 9 | CS040412000191 | 210.0 | 
Commentary :
The codedf_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index() df_tmp = df_customer. query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python') pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)does the following:df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index(): This creates a new DataFramedf_amount_sumby grouping thedf_receiptDataFrame bycustomer_idand summing theamountcolumn for each customer. The resulting DataFrame has two columns:customer_idandamount.df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python'): This selects a subset of thedf_customerDataFrame wheregender_cdequals 1 andcustomer_iddoes not start with "Z". Theengine='python'argument is used to allow the use of string methods likestartswith.pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id'): This merges thecustomer_idcolumn from thedf_tmpDataFrame with thedf_amount_sumDataFrame on thecustomer_idcolumn using a left join. This includes all rows from thedf_tmpDataFrame and adds theamountcolumn from thedf_amount_sumDataFrame, where available..fillna(0): This fills any missing values in theamountcolumn with 0..head(10): This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code calculates the total amount spent by each customer in thedf_receiptDataFrame, and then selects a subset of customers from thedf_customerDataFrame who are male and have validcustomer_idvalues. The code then merges the two DataFrames on thecustomer_idcolumn using a left join, and includes the total amount spent by each customer in the resulting DataFrame. Any missing values in theamountcolumn are filled with 0. The resulting DataFrame contains thecustomer_idandamountcolumns for the selected male customers.
P-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).
df_data = df_receipt \
          .query('not customer_id.str.startswith("Z")', engine='python')
df_cnt = df_data[~df_data.duplicated(subset=['customer_id', 'sales_ymd'])] \
         .groupby('customer_id').sales_ymd.count().reset_index() \
         .sort_values('sales_ymd', ascending=False).head(20)
df_sum = df_data.groupby('customer_id').amount.sum().reset_index() \
         .sort_values('amount', ascending=False).head(20)
pd.merge(df_cnt, df_sum, how='outer', on='customer_id')
| customer_id | sales_ymd | amount | |
|---|---|---|---|
| 0 | CS040214000008 | 23.0 | NaN | 
| 1 | CS015415000185 | 22.0 | 20153.0 | 
| 2 | CS010214000010 | 22.0 | 18585.0 | 
| 3 | CS010214000002 | 21.0 | NaN | 
| 4 | CS028415000007 | 21.0 | 19127.0 | 
| 5 | CS017415000097 | 20.0 | 23086.0 | 
| 6 | CS016415000141 | 20.0 | 18372.0 | 
| 7 | CS031414000051 | 19.0 | 19202.0 | 
| 8 | CS014214000023 | 19.0 | NaN | 
| 9 | CS022515000226 | 19.0 | NaN | 
| 10 | CS021515000172 | 19.0 | NaN | 
| 11 | CS039414000052 | 19.0 | NaN | 
| 12 | CS021514000045 | 19.0 | NaN | 
| 13 | CS022515000028 | 18.0 | NaN | 
| 14 | CS030214000008 | 18.0 | NaN | 
| 15 | CS021515000056 | 18.0 | NaN | 
| 16 | CS014415000077 | 18.0 | NaN | 
| 17 | CS021515000211 | 18.0 | NaN | 
| 18 | CS032415000209 | 18.0 | NaN | 
| 19 | CS031414000073 | 18.0 | NaN | 
| 20 | CS001605000009 | NaN | 18925.0 | 
| 21 | CS006515000023 | NaN | 18372.0 | 
| 22 | CS011414000106 | NaN | 18338.0 | 
| 23 | CS038415000104 | NaN | 17847.0 | 
| 24 | CS035414000024 | NaN | 17615.0 | 
| 25 | CS021515000089 | NaN | 17580.0 | 
| 26 | CS032414000072 | NaN | 16563.0 | 
| 27 | CS016415000101 | NaN | 16348.0 | 
| 28 | CS011415000006 | NaN | 16094.0 | 
| 29 | CS034415000047 | NaN | 16083.0 | 
| 30 | CS007514000094 | NaN | 15735.0 | 
| 31 | CS009414000059 | NaN | 15492.0 | 
| 32 | CS030415000034 | NaN | 15468.0 | 
| 33 | CS015515000034 | NaN | 15300.0 | 
Commentary :
This code is performing the following steps:
Filtering the rows in thedf_receiptdataframe where thecustomer_idcolumn does not start with "Z" using the.query()method and saving it todf_data.
Removing duplicate rows fromdf_databased oncustomer_idandsales_ymdcolumns, then grouping bycustomer_id, counting the number of sales dates and saving it todf_cnt.
Sortingdf_cntby thesales_ymdcolumn in descending order and taking the top 20 rows.
Groupingdf_databycustomer_idand summing theamountcolumn, then sorting byamountcolumn in descending order and taking the top 20 rows, and saving it todf_sum.
Mergingdf_cntanddf_sumdataframes on thecustomer_idcolumn using thepd.merge()method withhow='outer'parameter, which means that all records from both dataframes will be included in the merged dataframe, and saving it todf_data. The resulting dataframe will have the top 20 customers based on the count of sales dates and the sum of their purchase amount.
P-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.
df_store_tmp = df_store.copy() df_product_tmp = df_product.copy() df_store_tmp['key'] = 0 df_product_tmp['key'] = 0 len(pd.merge(df_store_tmp, df_product_tmp, how='outer', on='key'))
531590
Commentary :
This code merges the two dataframesdf_storeanddf_productusing an outer join.
First, two temporary dataframesdf_store_tmpanddf_product_tmpare created as copies ofdf_storeanddf_product, respectively. Then, a new column calledkeyis added to both dataframes with a constant value of 0. This is done because an outer join requires a common column to merge on.
Finally, the two dataframes are merged using thepd.merge()function, withhow='outer'argument indicating an outer join is desired andon='key'specifying the common column to merge on. The resulting dataframe contains all rows from both dataframes, with missing values (NaN) in the cells where there was no corresponding data in the other dataframe.
Lastly,len()is used to get the number of rows in the merged dataframe. Since both dataframes have the same number of rows, the output would be the same whether the inner, left, right, or outer join was used. In this case, it will output the total number of rows in the merged dataframe.







  
  
  
  

Comment