参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Getting Started
- First, run the following cells
- Import the necessary libraries and read data from the database (PostgreSQL)
- The following cell imports libraries that are expected to be used, such as pandas.
- If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
- The process can be divided into multiple times.
- Names, addresses, etc. are dummy data and are not real.
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
if 'PG_PORT' in os.environ:
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
# Connector for pd.read_sql
conn = psycopg2.connect(**pgconfig)
df_customer = pd.read_sql(sql='select * from customer', con=conn)
df_category = pd.read_sql(sql='select * from category', con=conn)
df_product = pd.read_sql(sql='select * from product', con=conn)
df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
df_store = pd.read_sql(sql='select * from store', con=conn)
df_geocode = pd.read_sql(sql='select * from geocode', con=conn)
else:
if not os.path.exists('data/'):
!git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
os.chdir('100knocks-preprocess/docker/work/')
dtype = {
'customer_id': str,
'gender_cd': str,
'postal_cd': str,
'application_store_cd': str,
'status_cd': str,
'category_major_cd': str,
'category_medium_cd': str,
'category_small_cd': str,
'product_cd': str,
'store_cd': str,
'prefecture_cd': str,
'tel_no': str,
'postal_cd': str,
'street': str
}
df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
df_category = pd.read_csv("data/category.csv", dtype=dtype)
df_product = pd.read_csv("data/product.csv", dtype=dtype)
df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
df_store = pd.read_csv("data/store.csv", dtype=dtype)
df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)
fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.
Exercise
P-021: Count the number of cases against the receipt details data (df_receipt).
len(df_receipt)
104681
Commentary :
The codelen(df_receipt)
returns the length of the DataFramedf_receipt
.
Here's an explanation of each component:len()
: This is a built-in Python function that returns the length of an object, such as a list or a string.df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name.
The combination oflen()
anddf_receipt
: By passing the DataFramedf_receipt
as an argument to thelen()
function, we are asking Python to return the number of rows indf_receipt
.
In summary, the code is simply returning the number of rows in a DataFrame.
P-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).
len(df_receipt['customer_id'].unique())
8307
Commentary :
The codelen(df_receipt['customer_id'].unique())
returns the number of unique customer IDs in thecustomer_id
column of thedf_receipt
DataFrame.
Here's an explanation of each component:len()
: This is a built-in Python function that returns the length of an object, such as a list or a string.df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name.['customer_id']
: This is a DataFrame indexing operation that selects a column with the labelcustomer_id
fromdf_receipt
. The resulting object is a pandas Series..unique()
: This is a pandas Series method that returns a numpy array of unique values in the Series. In this case, it returns a numpy array of unique customer IDs.
The combination oflen()
,df_receipt['customer_id'].unique()
, and()
: By passing the numpy array of unique customer IDs as an argument to thelen()
function, we are asking Python to return the number of unique customer IDs.
In summary, the code is returning the number of unique customer IDs in a DataFrame.
P-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).
# Code example 1
df_receipt.groupby('store_cd').agg({'amount':'sum',
'quantity':'sum'}).reset_index()
store_cd | amount | quantity | |
---|---|---|---|
0 | S12007 | 638761 | 2099 |
1 | S12013 | 787513 | 2425 |
2 | S12014 | 725167 | 2358 |
3 | S12029 | 794741 | 2555 |
4 | S12030 | 684402 | 2403 |
5 | S13001 | 811936 | 2347 |
6 | S13002 | 727821 | 2340 |
7 | S13003 | 764294 | 2197 |
8 | S13004 | 779373 | 2390 |
9 | S13005 | 629876 | 2004 |
10 | S13008 | 809288 | 2491 |
11 | S13009 | 808870 | 2486 |
12 | S13015 | 780873 | 2248 |
13 | S13016 | 793773 | 2432 |
14 | S13017 | 748221 | 2376 |
15 | S13018 | 790535 | 2562 |
16 | S13019 | 827833 | 2541 |
17 | S13020 | 796383 | 2383 |
18 | S13031 | 705968 | 2336 |
19 | S13032 | 790501 | 2491 |
20 | S13035 | 715869 | 2219 |
21 | S13037 | 693087 | 2344 |
22 | S13038 | 708884 | 2337 |
23 | S13039 | 611888 | 1981 |
24 | S13041 | 728266 | 2233 |
25 | S13043 | 587895 | 1881 |
26 | S13044 | 520764 | 1729 |
27 | S13051 | 107452 | 354 |
28 | S13052 | 100314 | 250 |
29 | S14006 | 712839 | 2284 |
30 | S14010 | 790361 | 2290 |
31 | S14011 | 805724 | 2434 |
32 | S14012 | 720600 | 2412 |
33 | S14021 | 699511 | 2231 |
34 | S14022 | 651328 | 2047 |
35 | S14023 | 727630 | 2258 |
36 | S14024 | 736323 | 2417 |
37 | S14025 | 755581 | 2394 |
38 | S14026 | 824537 | 2503 |
39 | S14027 | 714550 | 2303 |
40 | S14028 | 786145 | 2458 |
41 | S14033 | 725318 | 2282 |
42 | S14034 | 653681 | 2024 |
43 | S14036 | 203694 | 635 |
44 | S14040 | 701858 | 2233 |
45 | S14042 | 534689 | 1935 |
46 | S14045 | 458484 | 1398 |
47 | S14046 | 412646 | 1354 |
48 | S14047 | 338329 | 1041 |
49 | S14048 | 234276 | 769 |
50 | S14049 | 230808 | 788 |
51 | S14050 | 167090 | 580 |
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'sum', 'quantity':'sum'}).reset_index()
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then applies two aggregation functions (sum()
onamount
andquantity
) to the corresponding columns in each group. Finally, it resets the index of the resulting DataFrame.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'amount':'sum', 'quantity':'sum'})
: This is a pandas GroupBy object method that applies thesum()
aggregation function to theamount
andquantity
columns of each group in theGroupBy
object. The result is a DataFrame with two columns,amount
andquantity
, that contain the sums of these columns for each group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then computing the sum of two other columns for each group, returning a DataFrame with the sums of these columns for each group and a default index.
# Code example 2
df_receipt.groupby('store_cd')[['amount','quantity']].agg('sum').reset_index()
store_cd | amount | quantity | |
---|---|---|---|
0 | S12007 | 638761 | 2099 |
1 | S12013 | 787513 | 2425 |
2 | S12014 | 725167 | 2358 |
3 | S12029 | 794741 | 2555 |
4 | S12030 | 684402 | 2403 |
5 | S13001 | 811936 | 2347 |
6 | S13002 | 727821 | 2340 |
7 | S13003 | 764294 | 2197 |
8 | S13004 | 779373 | 2390 |
9 | S13005 | 629876 | 2004 |
10 | S13008 | 809288 | 2491 |
11 | S13009 | 808870 | 2486 |
12 | S13015 | 780873 | 2248 |
13 | S13016 | 793773 | 2432 |
14 | S13017 | 748221 | 2376 |
15 | S13018 | 790535 | 2562 |
16 | S13019 | 827833 | 2541 |
17 | S13020 | 796383 | 2383 |
18 | S13031 | 705968 | 2336 |
19 | S13032 | 790501 | 2491 |
20 | S13035 | 715869 | 2219 |
21 | S13037 | 693087 | 2344 |
22 | S13038 | 708884 | 2337 |
23 | S13039 | 611888 | 1981 |
24 | S13041 | 728266 | 2233 |
25 | S13043 | 587895 | 1881 |
26 | S13044 | 520764 | 1729 |
27 | S13051 | 107452 | 354 |
28 | S13052 | 100314 | 250 |
29 | S14006 | 712839 | 2284 |
30 | S14010 | 790361 | 2290 |
31 | S14011 | 805724 | 2434 |
32 | S14012 | 720600 | 2412 |
33 | S14021 | 699511 | 2231 |
34 | S14022 | 651328 | 2047 |
35 | S14023 | 727630 | 2258 |
36 | S14024 | 736323 | 2417 |
37 | S14025 | 755581 | 2394 |
38 | S14026 | 824537 | 2503 |
39 | S14027 | 714550 | 2303 |
40 | S14028 | 786145 | 2458 |
41 | S14033 | 725318 | 2282 |
42 | S14034 | 653681 | 2024 |
43 | S14036 | 203694 | 635 |
44 | S14040 | 701858 | 2233 |
45 | S14042 | 534689 | 1935 |
46 | S14045 | 458484 | 1398 |
47 | S14046 | 412646 | 1354 |
48 | S14047 | 338329 | 1041 |
49 | S14048 | 234276 | 769 |
50 | S14049 | 230808 | 788 |
51 | S14050 | 167090 | 580 |
Commentary :
This code performs a groupby operation on thedf_receipt
dataframe using thestore_cd
column, and then aggregates theamount
andquantity
columns by summing them up.
The resulting dataframe will have one row for each unique value in thestore_cd
column, and two columns:amount
andquantity
, both of which contain the sum of their respective values across all rows that have the samestore_cd
value.
The.reset_index()
method is then called on the resulting dataframe to convert thestore_cd
column from an index to a regular column.
P-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).
df_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)
customer_id | sales_ymd | |
---|---|---|
0 | CS001113000004 | 20190308 |
1 | CS001114000005 | 20190731 |
2 | CS001115000010 | 20190405 |
3 | CS001205000004 | 20190625 |
4 | CS001205000006 | 20190224 |
5 | CS001211000025 | 20190322 |
6 | CS001212000027 | 20170127 |
7 | CS001212000031 | 20180906 |
8 | CS001212000046 | 20170811 |
9 | CS001212000070 | 20191018 |
Commentary :
The code is performing groupby operation on a DataFrame calleddf_receipt
based on thecustomer_id
column. Then it is aggregating the values of thesales_ymd
column for each group using themax
function to find the latest date of purchase for each customer.
The result of the aggregation is a new DataFrame with two columns -customer_id
andsales_ymd
. Thereset_index
function is then used to reset the index of the resulting DataFrame to default and thehead(10)
function is used to display the first 10 rows of the resulting DataFrame.
So, the output of this code will be a DataFrame showing the maximum purchase date (sales_ymd
) for eachcustomer_id
, with only the first 10 rows displayed.
P-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.
# The same writing as 024 is also possible, but dare to introduce an example solution using a different writing style.
df_receipt.groupby('customer_id').sales_ymd.min().reset_index().head(10)
customer_id | sales_ymd | |
---|---|---|
0 | CS001113000004 | 20190308 |
1 | CS001114000005 | 20180503 |
2 | CS001115000010 | 20171228 |
3 | CS001205000004 | 20170914 |
4 | CS001205000006 | 20180207 |
5 | CS001211000025 | 20190322 |
6 | CS001212000027 | 20170127 |
7 | CS001212000031 | 20180906 |
8 | CS001212000046 | 20170811 |
9 | CS001212000070 | 20191018 |
Commentary :
The codedf_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)
groups the rows of thedf_receipt
DataFrame by thecustomer_id
column, and then applies themax()
aggregation function to thesales_ymd
column of each group. Finally, it resets the index of the resulting DataFrame and returns the first 10 rows.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('customer_id')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thecustomer_id
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'sales_ymd': 'max'})
: This is a pandas GroupBy object method that applies themax()
aggregation function to thesales_ymd
column of each group in theGroupBy
object. The result is a DataFrame with one column,sales_ymd
, that contains the maximum value of thesales_ymd
column for each group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..head(10)
: This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum value of another column for each group, returning a DataFrame with the maximum values and a default index, and then returning the first 10 rows of the DataFrame. This code is useful for identifying the most recent purchase date for each customer.
P-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.
df_tmp = df_receipt.groupby('customer_id'). \
agg({'sales_ymd':['max','min']}).reset_index()
# Make multi-index (item) hierarchies into a single-level index (item) while connecting them with "_".
# You can also use df_tmp.columns = ['customer_id', 'sales_ymd_max', 'sales_ymd_min'].
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]
df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)
customer_id_ | sales_ymd_max | sales_ymd_min | |
---|---|---|---|
1 | CS001114000005 | 20190731 | 20180503 |
2 | CS001115000010 | 20190405 | 20171228 |
3 | CS001205000004 | 20190625 | 20170914 |
4 | CS001205000006 | 20190224 | 20180207 |
13 | CS001214000009 | 20190902 | 20170306 |
14 | CS001214000017 | 20191006 | 20180828 |
16 | CS001214000048 | 20190929 | 20171109 |
17 | CS001214000052 | 20190617 | 20180208 |
20 | CS001215000005 | 20181021 | 20170206 |
21 | CS001215000040 | 20171022 | 20170214 |
Commentary :
The codedf_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index() df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)
groups the rows of thedf_receipt
DataFrame by thecustomer_id
column, and then applies two aggregation functions (max()
andmin()
) to thesales_ymd
column of each group, returning a DataFrame with the maximum and minimum purchase dates for each customer. Then, it selects only the rows where the maximum and minimum purchase dates are different, using thequery()
method, and returns the first 10 rows.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('customer_id')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thecustomer_id
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'sales_ymd':['max','min']})
: This is a pandas GroupBy object method that applies two aggregation functions (max()
andmin()
) to thesales_ymd
column of each group in theGroupBy
object. The result is a DataFrame with two columns,sales_ymd_max
andsales_ymd_min
, that contain the maximum and minimum values of thesales_ymd
column for each group, respectively..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..query('sales_ymd_max != sales_ymd_min')
: This is a pandas DataFrame method that selects only the rows of the DataFrame where the value in thesales_ymd_max
column is not equal to the value in thesales_ymd_min
column. This is done using a boolean expression in a string that is passed as an argument to thequery()
method..head(10)
: This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum and minimum values of another column for each group, returning a DataFrame with the maximum and minimum values and a default index. It then selects only the rows where the maximum and minimum values are different and returns the first 10 rows of the resulting DataFrame. This code is useful for identifying customers who have made purchases on different dates.
P-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.
df_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index(). \
sort_values('amount', ascending=False).head(5)
store_cd | amount | |
---|---|---|
28 | S13052 | 402.867470 |
12 | S13015 | 351.111960 |
7 | S13003 | 350.915519 |
30 | S14010 | 348.791262 |
5 | S13001 | 348.470386 |
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index().sort_values('amount', ascending=False).head(5)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then calculates the mean of theamount
column for each group, returning a DataFrame with the mean values and thestore_cd
column. It then resets the index of the resulting DataFrame and sorts it by the meanamount
values in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'amount':'mean'})
: This is a pandas GroupBy object method that applies themean()
aggregation function to theamount
column of each group in theGroupBy
object. The result is a DataFrame with one column,amount
, that contains the mean value of theamount
column for each group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False)
: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamount
column, in descending order. Theascending=False
parameter specifies that the sort order should be descending..head(5)
: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mean value of another column for each group, returning a DataFrame with the mean values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the mean values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest average sales.
P-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.
df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index(). \
sort_values('amount', ascending=False).head(5)
store_cd | amount | |
---|---|---|
28 | S13052 | 190.0 |
30 | S14010 | 188.0 |
51 | S14050 | 185.0 |
44 | S14040 | 180.0 |
7 | S13003 | 180.0 |
Commentary :
The codedf_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then calculates the median of theamount
column for each group, returning a DataFrame with the median values and thestore_cd
column. It then resets the index of the resulting DataFrame and sorts it by the medianamount
values in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..agg({'amount':'median'})
: This is a pandas GroupBy object method that applies themedian()
aggregation function to theamount
column of each group in theGroupBy
object. The result is a DataFrame with one column,amount
, that contains the median value of theamount
column for each group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False)
: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamount
column, in descending order. Theascending=False
parameter specifies that the sort order should be descending..head(5)
: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the median value of another column for each group, returning a DataFrame with the median values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the median values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest median sales.
P-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.
df_receipt.groupby('store_cd').product_cd. \
apply(lambda x: x.mode()).reset_index().head(10)
store_cd | level_1 | product_cd | |
---|---|---|---|
0 | S12007 | 0 | P060303001 |
1 | S12013 | 0 | P060303001 |
2 | S12014 | 0 | P060303001 |
3 | S12029 | 0 | P060303001 |
4 | S12030 | 0 | P060303001 |
5 | S13001 | 0 | P060303001 |
6 | S13002 | 0 | P060303001 |
7 | S13003 | 0 | P071401001 |
8 | S13004 | 0 | P060303001 |
9 | S13005 | 0 | P040503001 |
Commentary :
The codedf_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index().head(10)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then finds the mode of theproduct_cd
column for each group. It then returns a DataFrame with the mode values and thestore_cd
column. Thereset_index()
method resets the index of the resulting DataFrame to a default index.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..product_cd
: This is a pandas DataFrame attribute that accesses theproduct_cd
column of the DataFrame..apply(lambda x: x.mode())
: This is a pandas GroupBy object method that applies themode()
function to each group in theGroupBy
object. Themode()
function returns the most frequently occurring value(s) in each group. The resulting object is a Series with the same length as the original group..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mode value of another column for each group, returning a DataFrame with the mode values and the grouping column. Themode()
function is used to find the most frequently occurring product code for each store. This code is useful for identifying the most popular products sold at each store.
P-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index(). \
sort_values('amount', ascending=False).head(5)
store_cd | amount | |
---|---|---|
28 | S13052 | 440088.701311 |
31 | S14011 | 306314.558164 |
42 | S14034 | 296920.081011 |
5 | S13001 | 295431.993329 |
12 | S13015 | 295294.361116 |
Commentary :
The codedf_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then calculates the variance of theamount
column for each group, withddof=0
. It returns a DataFrame with the variance values and thestore_cd
column. Thereset_index()
method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the variance values in descending order and returns the first 5 rows.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..amount.var(ddof=0)
: This is a pandas DataFrame method that calculates the variance of theamount
column for each group in theGroupBy
object, withddof=0
. Theddof
parameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample variance. A value of0
forddof
indicates that the population variance should be calculated, rather than the sample variance. The result is a Series with the variance values and thestore_cd
values as the index..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False)
: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamount
column, in descending order. Theascending=False
parameter specifies that the sort order should be descending..head(5)
: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population variance of another column for each group, returning a DataFrame with the variance values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the variance values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.
P-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.
TIPS:
Note that the default values for ddof are different in Pandas and Numpy
Pandas:
DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Numpy:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)
df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index(). \
sort_values('amount', ascending=False).head(5)
store_cd | amount | |
---|---|---|
28 | S13052 | 663.391816 |
31 | S14011 | 553.456916 |
42 | S14034 | 544.903736 |
5 | S13001 | 543.536561 |
12 | S13015 | 543.409938 |
Commentary :
The codedf_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)
groups the rows of thedf_receipt
DataFrame by thestore_cd
column, and then calculates the standard deviation of theamount
column for each group, withddof=0
. It returns a DataFrame with the standard deviation values and thestore_cd
column. Thereset_index()
method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the standard deviation values in descending order and returns the first 5 rows.
Here's an explanation of each component:df_receipt
: This is a variable that represents a DataFrame object in pandas. The namedf_receipt
could be any valid variable name..groupby('store_cd')
: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in thestore_cd
column. This creates aGroupBy
object that can be used to apply functions to each group separately..amount.std(ddof=0)
: This is a pandas DataFrame method that calculates the standard deviation of theamount
column for each group in theGroupBy
object, withddof=0
. Theddof
parameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample standard deviation. A value of0
forddof
indicates that the population standard deviation should be calculated, rather than the sample standard deviation. The result is a Series with the standard deviation values and thestore_cd
values as the index..reset_index()
: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index..sort_values('amount', ascending=False)
: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in theamount
column, in descending order. Theascending=False
parameter specifies that the sort order should be descending..head(5)
: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.
In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population standard deviation of another column for each group, returning a DataFrame with the standard deviation values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the standard deviation values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.
P-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.
# Code Example 1
np.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)
array([ 102., 170., 288., 10925.])
Commentary :
The codenp.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)
computes the quartiles of the values in theamount
column of thedf_receipt
DataFrame using thenp.percentile()
function from the NumPy library.
Here's an explanation of each component:np.percentile()
: This is a NumPy function that computes the percentiles of an array. It takes two arguments: the first argument is the array for which percentiles are to be calculated, and the second argument is the percentile values to compute. Theq
parameter is used to specify the percentiles to compute. For example,q=np.arange(1, 5) * 25
computes the percentiles at positions 25, 50, 75 in the distribution of the data. This is because thenp.arange(1, 5) * 25
expression generates an array with values[25, 50, 75]
, which are the positions of the percentiles in the distribution.df_receipt['amount']
: This is a pandas Series object that represents theamount
column of thedf_receipt
DataFrame. Thedf_receipt
DataFrame is assumed to exist in the current environment.q=np.arange(1, 5) * 25
: This is a NumPy array that specifies the percentiles to compute. Thenp.arange(1, 5) * 25
expression generates an array with values[25, 50, 75]
, which are the positions of the percentiles in the distribution.
The code computes the quartiles of theamount
column of thedf_receipt
DataFrame, which are the values that divide the distribution into four equal parts. The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the 50th percentile (also known as the median), and the third quartile (Q3) is the 75th percentile. The output of the code is an array of the quartile values, which can be used to summarize the distribution of theamount
column.
# Code Example 2
df_receipt.amount.quantile(q=np.arange(1, 5) / 4)
0.25 102.0 0.50 170.0 0.75 288.0 1.00 10925.0 Name: amount, dtype: float64
Commentary :
This code calculates the quartiles of theamount
column in thedf_receipt
dataframe using thequantile()
function from pandas. Theq
parameter specifies the quantiles to compute, which are the 1/4, 2/4 (median), 3/4, and 4/4 (maximum) quantiles.
The code calculates the quartiles by callingquantile()
on theamount
column ofdf_receipt
with theq
parameter set to an array of values[0.25, 0.5, 0.75, 1.0]
which corresponds to the quartiles 1, 2, 3 and 4 respectively.
The resulting output of this code would be an array of four values, each representing the value of the corresponding quartile.
P-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.
df_receipt.groupby('store_cd').amount.mean(). \
reset_index().query('amount >= 330')
store_cd | amount | |
---|---|---|
1 | S12013 | 330.194130 |
5 | S13001 | 348.470386 |
7 | S13003 | 350.915519 |
8 | S13004 | 330.943949 |
12 | S13015 | 351.111960 |
16 | S13019 | 330.208616 |
17 | S13020 | 337.879932 |
28 | S13052 | 402.867470 |
30 | S14010 | 348.791262 |
31 | S14011 | 335.718333 |
38 | S14026 | 332.340588 |
46 | S14045 | 330.082073 |
48 | S14047 | 330.077073 |
Commentary :
The codedf_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')
groups thedf_receipt
DataFrame by thestore_cd
column, calculates the mean value of theamount
column for each group, resets the index of the resulting DataFrame, and then filters the resulting DataFrame to only include rows where the mean value ofamount
is greater than or equal to 330.
Here's an explanation of each component:df_receipt.groupby('store_cd')
: This groups thedf_receipt
DataFrame by the values in thestore_cd
column..amount.mean()
: This calculates the mean value of theamount
column for each group..reset_index()
: This resets the index of the resulting DataFrame..query('amount >= 330')
: This filters the resulting DataFrame to only include rows where the mean value ofamount
is greater than or equal to 330.
In summary, this code selects only the stores whose average transaction amount is greater than or equal to 330 yen.
P-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.
# # Code example 1: Writing without query
df_receipt[~df_receipt['customer_id'].str.startswith("Z")]. \
groupby('customer_id').amount.sum().mean()
2547.742234529256
Commentary :
The codedf_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()
does the following:df_receipt[~df_receipt['customer_id'].str.startswith("Z")]
: This filters thedf_receipt
DataFrame to exclude rows where thecustomer_id
column starts with the letter "Z". The tilde (~) character is used as a negation operator to invert the boolean values returned by thestr.startswith()
method..groupby('customer_id').amount.sum()
: This groups the filtered DataFrame by thecustomer_id
column, and calculates the sum of theamount
column for each group..mean()
: This calculates the mean value of the resulting Series, which represents the total amount spent by each customer (excluding customers whosecustomer_id
starts with "Z").
In summary, this code computes the average amount spent by each customer whosecustomer_id
does not start with the letter "Z" in thedf_receipt
DataFrame. This may be useful for understanding the purchasing behavior of regular customers.
# Code example 2: Writing with query
df_receipt.query('not customer_id.str.startswith("Z")',
engine='python').groupby('customer_id').amount.sum().mean()
2547.742234529256
Commentary :
This code computes the average amount spent by non-anonymous customers. Here's how it works:df_receipt
is a DataFrame containing information about transactions.
Thequery
method is used to filter out transactions where thecustomer_id
starts with the letter "Z". This is done using thestr.startswith()
method, which checks if a string starts with a certain character or substring.
The resulting DataFrame is grouped bycustomer_id
using thegroupby
method.
Theamount
column is selected using thesum
method to compute the total amount spent by each customer.
Themean
method is then used to compute the average of the amounts computed in the previous step. This is the final result of the code.
P-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.
df_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].\
groupby('customer_id').amount.sum()
amount_mean = df_amount_sum.mean()
df_amount_sum = df_amount_sum.reset_index()
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
customer_id | amount | |
---|---|---|
2 | CS001115000010 | 3044 |
4 | CS001205000006 | 3337 |
13 | CS001214000009 | 4685 |
14 | CS001214000017 | 4132 |
17 | CS001214000052 | 5639 |
21 | CS001215000040 | 3496 |
30 | CS001304000006 | 3726 |
32 | CS001305000005 | 3485 |
33 | CS001305000011 | 4370 |
53 | CS001315000180 | 3300 |
Commentary :
The codedf_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum()
,amount_mean = df_amount_sum.mean()
,df_amount_sum = df_amount_sum.reset_index()
, anddf_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
does the following:df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum()
: This filters thedf_receipt
DataFrame to exclude rows where thecustomer_id
column starts with the letter "Z", groups the resulting DataFrame by thecustomer_id
column, and calculates the sum of theamount
column for each group. The resulting object is a Series that represents the total amount spent by each customer (excluding customers whosecustomer_id
starts with "Z").amount_mean = df_amount_sum.mean()
: This calculates the mean value of thedf_amount_sum
Series, which represents the average amount spent by each customer whosecustomer_id
does not start with the letter "Z" in thedf_receipt
DataFrame.df_amount_sum = df_amount_sum.reset_index()
: This resets the index of thedf_amount_sum
Series and converts it to a DataFrame with two columns:customer_id
andamount
.df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
: This filters thedf_amount_sum
DataFrame to include only rows where theamount
column is greater than or equal toamount_mean
, and returns the first 10 rows of the resulting DataFrame.
In summary, this code selects the customers whose total spending is greater than or equal to the average spending of all customers whosecustomer_id
does not start with the letter "Z" in thedf_receipt
DataFrame. The resulting DataFrame contains thecustomer_id
and total amount spent for each selected customer, sorted by descending order of total spending.
P-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.
pd.merge(df_receipt, df_store[['store_cd','store_name']],
how='inner', on='store_cd').head(10)
sales_ymd | sales_epoch | store_cd | receipt_no | receipt_sub_no | customer_id | product_cd | quantity | amount | store_name | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 20181103 | 1541203200 | S14006 | 112 | 1 | CS006214000001 | P070305012 | 1 | 158 | 葛が谷店 |
1 | 20181116 | 1542326400 | S14006 | 112 | 2 | ZZ000000000000 | P080401001 | 1 | 48 | 葛が谷店 |
2 | 20170118 | 1484697600 | S14006 | 1162 | 1 | CS006815000006 | P050406035 | 1 | 220 | 葛が谷店 |
3 | 20190524 | 1558656000 | S14006 | 1192 | 1 | CS006514000034 | P060104003 | 1 | 80 | 葛が谷店 |
4 | 20190419 | 1555632000 | S14006 | 112 | 2 | ZZ000000000000 | P060501002 | 1 | 148 | 葛が谷店 |
5 | 20181119 | 1542585600 | S14006 | 1152 | 2 | ZZ000000000000 | P050701001 | 1 | 88 | 葛が谷店 |
6 | 20171211 | 1512950400 | S14006 | 1132 | 2 | CS006515000175 | P090903001 | 1 | 80 | 葛が谷店 |
7 | 20191021 | 1571616000 | S14006 | 1112 | 2 | CS006415000221 | P040602001 | 1 | 405 | 葛が谷店 |
8 | 20170710 | 1499644800 | S14006 | 1132 | 2 | CS006411000036 | P090301051 | 1 | 330 | 葛が谷店 |
9 | 20190805 | 1564963200 | S14006 | 112 | 1 | CS006211000012 | P050104001 | 1 | 115 | 葛が谷店 |
Commentary :
The codepd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)
does the following:df_store[['store_cd','store_name']]
: This selects a subset of thedf_store
DataFrame with two columns:store_cd
andstore_name
.pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd')
: This performs an inner join between thedf_receipt
DataFrame and thedf_store[['store_cd','store_name']]
DataFrame on thestore_cd
column. This will merge the two DataFrames on thestore_cd
column, and only include rows that have matching values in both DataFrames..head(10)
: This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code merges thedf_receipt
DataFrame with thedf_store
DataFrame on thestore_cd
column, and adds thestore_name
column from thedf_store
DataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the stores where the transactions indf_receipt
occurred, including thestore_cd
,store_name
, and all other columns from thedf_receipt
DataFrame.
P-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.
pd.merge(df_product
, df_category[['category_small_cd','category_small_name']]
, how='inner', on='category_small_cd').head(10)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_small_name | |
---|---|---|---|---|---|---|---|
0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 弁当類 |
1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 弁当類 |
2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 弁当類 |
3 | P040101004 | 04 | 0401 | 040101 | 248.0 | 186.0 | 弁当類 |
4 | P040101005 | 04 | 0401 | 040101 | 268.0 | 201.0 | 弁当類 |
5 | P040101006 | 04 | 0401 | 040101 | 298.0 | 224.0 | 弁当類 |
6 | P040101007 | 04 | 0401 | 040101 | 338.0 | 254.0 | 弁当類 |
7 | P040101008 | 04 | 0401 | 040101 | 420.0 | 315.0 | 弁当類 |
8 | P040101009 | 04 | 0401 | 040101 | 498.0 | 374.0 | 弁当類 |
9 | P040101010 | 04 | 0401 | 040101 | 580.0 | 435.0 | 弁当類 |
Commentary :
The codepd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd').head(10)
does the following:df_category[['category_small_cd','category_small_name']]
: This selects a subset of thedf_category
DataFrame with two columns:category_small_cd
andcategory_small_name
.pd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd')
: This performs an inner join between thedf_product
DataFrame and thedf_category[['category_small_cd','category_small_name']]
DataFrame on thecategory_small_cd
column. This will merge the two DataFrames on thecategory_small_cd
column, and only include rows that have matching values in both DataFrames..head(10)
: This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code merges thedf_product
DataFrame with thedf_category
DataFrame on thecategory_small_cd
column, and adds thecategory_small_name
column from thedf_category
DataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the products indf_product
, including thecategory_small_cd
,category_small_name
, and all other columns from thedf_product
DataFrame.
P-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_tmp = df_customer. \
query('gender_cd == "1" and not customer_id.str.startswith("Z")',
engine='python')
pd.merge(df_tmp['customer_id'], df_amount_sum,
how='left', on='customer_id').fillna(0).head(10)
customer_id | amount | |
---|---|---|
0 | CS021313000114 | 0.0 |
1 | CS031415000172 | 5088.0 |
2 | CS028811000001 | 0.0 |
3 | CS001215000145 | 875.0 |
4 | CS015414000103 | 3122.0 |
5 | CS033513000180 | 868.0 |
6 | CS035614000014 | 0.0 |
7 | CS011215000048 | 3444.0 |
8 | CS009413000079 | 0.0 |
9 | CS040412000191 | 210.0 |
Commentary :
The codedf_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index() df_tmp = df_customer. query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python') pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)
does the following:df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
: This creates a new DataFramedf_amount_sum
by grouping thedf_receipt
DataFrame bycustomer_id
and summing theamount
column for each customer. The resulting DataFrame has two columns:customer_id
andamount
.df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python')
: This selects a subset of thedf_customer
DataFrame wheregender_cd
equals 1 andcustomer_id
does not start with "Z". Theengine='python'
argument is used to allow the use of string methods likestartswith
.pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id')
: This merges thecustomer_id
column from thedf_tmp
DataFrame with thedf_amount_sum
DataFrame on thecustomer_id
column using a left join. This includes all rows from thedf_tmp
DataFrame and adds theamount
column from thedf_amount_sum
DataFrame, where available..fillna(0)
: This fills any missing values in theamount
column with 0..head(10)
: This returns the first 10 rows of the resulting merged DataFrame.
In summary, this code calculates the total amount spent by each customer in thedf_receipt
DataFrame, and then selects a subset of customers from thedf_customer
DataFrame who are male and have validcustomer_id
values. The code then merges the two DataFrames on thecustomer_id
column using a left join, and includes the total amount spent by each customer in the resulting DataFrame. Any missing values in theamount
column are filled with 0. The resulting DataFrame contains thecustomer_id
andamount
columns for the selected male customers.
P-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).
df_data = df_receipt \
.query('not customer_id.str.startswith("Z")', engine='python')
df_cnt = df_data[~df_data.duplicated(subset=['customer_id', 'sales_ymd'])] \
.groupby('customer_id').sales_ymd.count().reset_index() \
.sort_values('sales_ymd', ascending=False).head(20)
df_sum = df_data.groupby('customer_id').amount.sum().reset_index() \
.sort_values('amount', ascending=False).head(20)
pd.merge(df_cnt, df_sum, how='outer', on='customer_id')
customer_id | sales_ymd | amount | |
---|---|---|---|
0 | CS040214000008 | 23.0 | NaN |
1 | CS015415000185 | 22.0 | 20153.0 |
2 | CS010214000010 | 22.0 | 18585.0 |
3 | CS010214000002 | 21.0 | NaN |
4 | CS028415000007 | 21.0 | 19127.0 |
5 | CS017415000097 | 20.0 | 23086.0 |
6 | CS016415000141 | 20.0 | 18372.0 |
7 | CS031414000051 | 19.0 | 19202.0 |
8 | CS014214000023 | 19.0 | NaN |
9 | CS022515000226 | 19.0 | NaN |
10 | CS021515000172 | 19.0 | NaN |
11 | CS039414000052 | 19.0 | NaN |
12 | CS021514000045 | 19.0 | NaN |
13 | CS022515000028 | 18.0 | NaN |
14 | CS030214000008 | 18.0 | NaN |
15 | CS021515000056 | 18.0 | NaN |
16 | CS014415000077 | 18.0 | NaN |
17 | CS021515000211 | 18.0 | NaN |
18 | CS032415000209 | 18.0 | NaN |
19 | CS031414000073 | 18.0 | NaN |
20 | CS001605000009 | NaN | 18925.0 |
21 | CS006515000023 | NaN | 18372.0 |
22 | CS011414000106 | NaN | 18338.0 |
23 | CS038415000104 | NaN | 17847.0 |
24 | CS035414000024 | NaN | 17615.0 |
25 | CS021515000089 | NaN | 17580.0 |
26 | CS032414000072 | NaN | 16563.0 |
27 | CS016415000101 | NaN | 16348.0 |
28 | CS011415000006 | NaN | 16094.0 |
29 | CS034415000047 | NaN | 16083.0 |
30 | CS007514000094 | NaN | 15735.0 |
31 | CS009414000059 | NaN | 15492.0 |
32 | CS030415000034 | NaN | 15468.0 |
33 | CS015515000034 | NaN | 15300.0 |
Commentary :
This code is performing the following steps:
Filtering the rows in thedf_receipt
dataframe where thecustomer_id
column does not start with "Z" using the.query()
method and saving it todf_data
.
Removing duplicate rows fromdf_data
based oncustomer_id
andsales_ymd
columns, then grouping bycustomer_id
, counting the number of sales dates and saving it todf_cnt
.
Sortingdf_cnt
by thesales_ymd
column in descending order and taking the top 20 rows.
Groupingdf_data
bycustomer_id
and summing theamount
column, then sorting byamount
column in descending order and taking the top 20 rows, and saving it todf_sum
.
Mergingdf_cnt
anddf_sum
dataframes on thecustomer_id
column using thepd.merge()
method withhow='outer'
parameter, which means that all records from both dataframes will be included in the merged dataframe, and saving it todf_data
. The resulting dataframe will have the top 20 customers based on the count of sales dates and the sum of their purchase amount.
P-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.
df_store_tmp = df_store.copy() df_product_tmp = df_product.copy() df_store_tmp['key'] = 0 df_product_tmp['key'] = 0 len(pd.merge(df_store_tmp, df_product_tmp, how='outer', on='key'))
531590
Commentary :
This code merges the two dataframesdf_store
anddf_product
using an outer join.
First, two temporary dataframesdf_store_tmp
anddf_product_tmp
are created as copies ofdf_store
anddf_product
, respectively. Then, a new column calledkey
is added to both dataframes with a constant value of 0. This is done because an outer join requires a common column to merge on.
Finally, the two dataframes are merged using thepd.merge()
function, withhow='outer'
argument indicating an outer join is desired andon='key'
specifying the common column to merge on. The resulting dataframe contains all rows from both dataframes, with missing values (NaN
) in the cells where there was no corresponding data in the other dataframe.
Lastly,len()
is used to get the number of rows in the merged dataframe. Since both dataframes have the same number of rows, the output would be the same whether the inner, left, right, or outer join was used. In this case, it will output the total number of rows in the merged dataframe.
Comment