参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

Data Science 100 Knocks (Structured Data Processing) URL :

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.

The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.

Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

## Getting Started

- First, run the following cells
- Import the necessary libraries and read data from the database (PostgreSQL)
- The following cell imports libraries that are expected to be used, such as pandas.
- If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
- The process can be divided into multiple times.
- Names, addresses, etc. are dummy data and are not real.

```
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
if 'PG_PORT' in os.environ:
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
# Connector for pd.read_sql
conn = psycopg2.connect(**pgconfig)
df_customer = pd.read_sql(sql='select * from customer', con=conn)
df_category = pd.read_sql(sql='select * from category', con=conn)
df_product = pd.read_sql(sql='select * from product', con=conn)
df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
df_store = pd.read_sql(sql='select * from store', con=conn)
df_geocode = pd.read_sql(sql='select * from geocode', con=conn)
else:
if not os.path.exists('data/'):
!git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
os.chdir('100knocks-preprocess/docker/work/')
dtype = {
'customer_id': str,
'gender_cd': str,
'postal_cd': str,
'application_store_cd': str,
'status_cd': str,
'category_major_cd': str,
'category_medium_cd': str,
'category_small_cd': str,
'product_cd': str,
'store_cd': str,
'prefecture_cd': str,
'tel_no': str,
'postal_cd': str,
'street': str
}
df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
df_category = pd.read_csv("data/category.csv", dtype=dtype)
df_product = pd.read_csv("data/product.csv", dtype=dtype)
df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
df_store = pd.read_csv("data/store.csv", dtype=dtype)
df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)
```

fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.

## Exercise

P-021: Count the number of cases against the receipt details data (df_receipt).

```
len(df_receipt)
```

104681

Commentary :

The code`len(df_receipt)`

returns the length of the DataFrame`df_receipt`

.

Here's an explanation of each component:`len()`

: This is a built-in Python function that returns the length of an object, such as a list or a string.`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.

The combination of`len()`

and`df_receipt`

: By passing the DataFrame`df_receipt`

as an argument to the`len()`

function, we are asking Python to return the number of rows in`df_receipt`

.

In summary, the code is simply returning the number of rows in a DataFrame.

P-022: Count the number of unique cases against the customer ID (customer_id) of the receipt details data (df_receipt).

```
len(df_receipt['customer_id'].unique())
```

8307

Commentary :

The code`len(df_receipt['customer_id'].unique())`

returns the number of unique customer IDs in the`customer_id`

column of the`df_receipt`

DataFrame.

Here's an explanation of each component:`len()`

: This is a built-in Python function that returns the length of an object, such as a list or a string.`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`['customer_id']`

: This is a DataFrame indexing operation that selects a column with the label`customer_id`

from`df_receipt`

. The resulting object is a pandas Series.`.unique()`

: This is a pandas Series method that returns a numpy array of unique values in the Series. In this case, it returns a numpy array of unique customer IDs.

The combination of`len()`

,`df_receipt['customer_id'].unique()`

, and`()`

: By passing the numpy array of unique customer IDs as an argument to the`len()`

function, we are asking Python to return the number of unique customer IDs.

In summary, the code is returning the number of unique customer IDs in a DataFrame.

P-023: Sum the sales amount (amount) and sales quantity (quantity) for each shop code (store_cd) in the receipt details data (df_receipt).

```
# Code example 1
df_receipt.groupby('store_cd').agg({'amount':'sum',
'quantity':'sum'}).reset_index()
```

store_cd | amount | quantity | |
---|---|---|---|

0 | S12007 | 638761 | 2099 |

1 | S12013 | 787513 | 2425 |

2 | S12014 | 725167 | 2358 |

3 | S12029 | 794741 | 2555 |

4 | S12030 | 684402 | 2403 |

5 | S13001 | 811936 | 2347 |

6 | S13002 | 727821 | 2340 |

7 | S13003 | 764294 | 2197 |

8 | S13004 | 779373 | 2390 |

9 | S13005 | 629876 | 2004 |

10 | S13008 | 809288 | 2491 |

11 | S13009 | 808870 | 2486 |

12 | S13015 | 780873 | 2248 |

13 | S13016 | 793773 | 2432 |

14 | S13017 | 748221 | 2376 |

15 | S13018 | 790535 | 2562 |

16 | S13019 | 827833 | 2541 |

17 | S13020 | 796383 | 2383 |

18 | S13031 | 705968 | 2336 |

19 | S13032 | 790501 | 2491 |

20 | S13035 | 715869 | 2219 |

21 | S13037 | 693087 | 2344 |

22 | S13038 | 708884 | 2337 |

23 | S13039 | 611888 | 1981 |

24 | S13041 | 728266 | 2233 |

25 | S13043 | 587895 | 1881 |

26 | S13044 | 520764 | 1729 |

27 | S13051 | 107452 | 354 |

28 | S13052 | 100314 | 250 |

29 | S14006 | 712839 | 2284 |

30 | S14010 | 790361 | 2290 |

31 | S14011 | 805724 | 2434 |

32 | S14012 | 720600 | 2412 |

33 | S14021 | 699511 | 2231 |

34 | S14022 | 651328 | 2047 |

35 | S14023 | 727630 | 2258 |

36 | S14024 | 736323 | 2417 |

37 | S14025 | 755581 | 2394 |

38 | S14026 | 824537 | 2503 |

39 | S14027 | 714550 | 2303 |

40 | S14028 | 786145 | 2458 |

41 | S14033 | 725318 | 2282 |

42 | S14034 | 653681 | 2024 |

43 | S14036 | 203694 | 635 |

44 | S14040 | 701858 | 2233 |

45 | S14042 | 534689 | 1935 |

46 | S14045 | 458484 | 1398 |

47 | S14046 | 412646 | 1354 |

48 | S14047 | 338329 | 1041 |

49 | S14048 | 234276 | 769 |

50 | S14049 | 230808 | 788 |

51 | S14050 | 167090 | 580 |

Commentary :

The code`df_receipt.groupby('store_cd').agg({'amount':'sum', 'quantity':'sum'}).reset_index()`

groups the rows of the`df_receipt`

DataFrame by the`store_cd`

column, and then applies two aggregation functions (`sum()`

on`amount`

and`quantity`

) to the corresponding columns in each group. Finally, it resets the index of the resulting DataFrame.

Here's an explanation of each component:`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`.groupby('store_cd')`

: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the`store_cd`

column. This creates a`GroupBy`

object that can be used to apply functions to each group separately.`.agg({'amount':'sum', 'quantity':'sum'})`

: This is a pandas GroupBy object method that applies the`sum()`

aggregation function to the`amount`

and`quantity`

columns of each group in the`GroupBy`

object. The result is a DataFrame with two columns,`amount`

and`quantity`

, that contain the sums of these columns for each group.`.reset_index()`

: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then computing the sum of two other columns for each group, returning a DataFrame with the sums of these columns for each group and a default index.

```
# Code example 2
df_receipt.groupby('store_cd')[['amount','quantity']].agg('sum').reset_index()
```

store_cd | amount | quantity | |
---|---|---|---|

0 | S12007 | 638761 | 2099 |

1 | S12013 | 787513 | 2425 |

2 | S12014 | 725167 | 2358 |

3 | S12029 | 794741 | 2555 |

4 | S12030 | 684402 | 2403 |

5 | S13001 | 811936 | 2347 |

6 | S13002 | 727821 | 2340 |

7 | S13003 | 764294 | 2197 |

8 | S13004 | 779373 | 2390 |

9 | S13005 | 629876 | 2004 |

10 | S13008 | 809288 | 2491 |

11 | S13009 | 808870 | 2486 |

12 | S13015 | 780873 | 2248 |

13 | S13016 | 793773 | 2432 |

14 | S13017 | 748221 | 2376 |

15 | S13018 | 790535 | 2562 |

16 | S13019 | 827833 | 2541 |

17 | S13020 | 796383 | 2383 |

18 | S13031 | 705968 | 2336 |

19 | S13032 | 790501 | 2491 |

20 | S13035 | 715869 | 2219 |

21 | S13037 | 693087 | 2344 |

22 | S13038 | 708884 | 2337 |

23 | S13039 | 611888 | 1981 |

24 | S13041 | 728266 | 2233 |

25 | S13043 | 587895 | 1881 |

26 | S13044 | 520764 | 1729 |

27 | S13051 | 107452 | 354 |

28 | S13052 | 100314 | 250 |

29 | S14006 | 712839 | 2284 |

30 | S14010 | 790361 | 2290 |

31 | S14011 | 805724 | 2434 |

32 | S14012 | 720600 | 2412 |

33 | S14021 | 699511 | 2231 |

34 | S14022 | 651328 | 2047 |

35 | S14023 | 727630 | 2258 |

36 | S14024 | 736323 | 2417 |

37 | S14025 | 755581 | 2394 |

38 | S14026 | 824537 | 2503 |

39 | S14027 | 714550 | 2303 |

40 | S14028 | 786145 | 2458 |

41 | S14033 | 725318 | 2282 |

42 | S14034 | 653681 | 2024 |

43 | S14036 | 203694 | 635 |

44 | S14040 | 701858 | 2233 |

45 | S14042 | 534689 | 1935 |

46 | S14045 | 458484 | 1398 |

47 | S14046 | 412646 | 1354 |

48 | S14047 | 338329 | 1041 |

49 | S14048 | 234276 | 769 |

50 | S14049 | 230808 | 788 |

51 | S14050 | 167090 | 580 |

Commentary :

This code performs a groupby operation on the`df_receipt`

dataframe using the`store_cd`

column, and then aggregates the`amount`

and`quantity`

columns by summing them up.

The resulting dataframe will have one row for each unique value in the`store_cd`

column, and two columns:`amount`

and`quantity`

, both of which contain the sum of their respective values across all rows that have the same`store_cd`

value.

The`.reset_index()`

method is then called on the resulting dataframe to convert the`store_cd`

column from an index to a regular column.

P-024: Find the most recent sales date (sales_ymd) for each customer ID (customer_id) for the receipt details data (df_receipt) and display the 10 most recent sales dates (sales_ymd).

```
df_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)
```

customer_id | sales_ymd | |
---|---|---|

0 | CS001113000004 | 20190308 |

1 | CS001114000005 | 20190731 |

2 | CS001115000010 | 20190405 |

3 | CS001205000004 | 20190625 |

4 | CS001205000006 | 20190224 |

5 | CS001211000025 | 20190322 |

6 | CS001212000027 | 20170127 |

7 | CS001212000031 | 20180906 |

8 | CS001212000046 | 20170811 |

9 | CS001212000070 | 20191018 |

Commentary :

The code is performing groupby operation on a DataFrame called`df_receipt`

based on the`customer_id`

column. Then it is aggregating the values of the`sales_ymd`

column for each group using the`max`

function to find the latest date of purchase for each customer.

The result of the aggregation is a new DataFrame with two columns -`customer_id`

and`sales_ymd`

. The`reset_index`

function is then used to reset the index of the resulting DataFrame to default and the`head(10)`

function is used to display the first 10 rows of the resulting DataFrame.

So, the output of this code will be a DataFrame showing the maximum purchase date (`sales_ymd`

) for each`customer_id`

, with only the first 10 rows displayed.

P-025: For receipt details data (df_receipt), find the oldest sales date (sales_ymd) for each customer ID (customer_id) and display the 10 results.

```
# The same writing as 024 is also possible, but dare to introduce an example solution using a different writing style.
df_receipt.groupby('customer_id').sales_ymd.min().reset_index().head(10)
```

customer_id | sales_ymd | |
---|---|---|

0 | CS001113000004 | 20190308 |

1 | CS001114000005 | 20180503 |

2 | CS001115000010 | 20171228 |

3 | CS001205000004 | 20170914 |

4 | CS001205000006 | 20180207 |

5 | CS001211000025 | 20190322 |

6 | CS001212000027 | 20170127 |

7 | CS001212000031 | 20180906 |

8 | CS001212000046 | 20170811 |

9 | CS001212000070 | 20191018 |

Commentary :

The code`df_receipt.groupby('customer_id').agg({'sales_ymd': 'max'}).reset_index().head(10)`

groups the rows of the`df_receipt`

DataFrame by the`customer_id`

column, and then applies the`max()`

aggregation function to the`sales_ymd`

column of each group. Finally, it resets the index of the resulting DataFrame and returns the first 10 rows.

Here's an explanation of each component:`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`.groupby('customer_id')`

: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the`customer_id`

column. This creates a`GroupBy`

object that can be used to apply functions to each group separately.`.agg({'sales_ymd': 'max'})`

: This is a pandas GroupBy object method that applies the`max()`

aggregation function to the`sales_ymd`

column of each group in the`GroupBy`

object. The result is a DataFrame with one column,`sales_ymd`

, that contains the maximum value of the`sales_ymd`

column for each group.`.reset_index()`

: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.`.head(10)`

: This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum value of another column for each group, returning a DataFrame with the maximum values and a default index, and then returning the first 10 rows of the DataFrame. This code is useful for identifying the most recent purchase date for each customer.

P-026: For receipt details data (df_receipt), find the newest sales date (sales_ymd) and the oldest sales date (sales_ymd) for each customer ID (customer_id) and display 10 cases where both are different.

```
df_tmp = df_receipt.groupby('customer_id'). \
agg({'sales_ymd':['max','min']}).reset_index()
# Make multi-index (item) hierarchies into a single-level index (item) while connecting them with "_".
# You can also use df_tmp.columns = ['customer_id', 'sales_ymd_max', 'sales_ymd_min'].
df_tmp.columns = ["_".join(pair) for pair in df_tmp.columns]
df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)
```

customer_id_ | sales_ymd_max | sales_ymd_min | |
---|---|---|---|

1 | CS001114000005 | 20190731 | 20180503 |

2 | CS001115000010 | 20190405 | 20171228 |

3 | CS001205000004 | 20190625 | 20170914 |

4 | CS001205000006 | 20190224 | 20180207 |

13 | CS001214000009 | 20190902 | 20170306 |

14 | CS001214000017 | 20191006 | 20180828 |

16 | CS001214000048 | 20190929 | 20171109 |

17 | CS001214000052 | 20190617 | 20180208 |

20 | CS001215000005 | 20181021 | 20170206 |

21 | CS001215000040 | 20171022 | 20170214 |

Commentary :

The code`df_tmp = df_receipt.groupby('customer_id').agg({'sales_ymd':['max','min']}).reset_index() df_tmp.query('sales_ymd_max != sales_ymd_min').head(10)`

groups the rows of the`df_receipt`

DataFrame by the`customer_id`

column, and then applies two aggregation functions (`max()`

and`min()`

) to the`sales_ymd`

column of each group, returning a DataFrame with the maximum and minimum purchase dates for each customer. Then, it selects only the rows where the maximum and minimum purchase dates are different, using the`query()`

method, and returns the first 10 rows.

Here's an explanation of each component:`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`.groupby('customer_id')`

: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the`customer_id`

column. This creates a`GroupBy`

object that can be used to apply functions to each group separately.`.agg({'sales_ymd':['max','min']})`

: This is a pandas GroupBy object method that applies two aggregation functions (`max()`

and`min()`

) to the`sales_ymd`

column of each group in the`GroupBy`

object. The result is a DataFrame with two columns,`sales_ymd_max`

and`sales_ymd_min`

, that contain the maximum and minimum values of the`sales_ymd`

column for each group, respectively.`.reset_index()`

: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.`.query('sales_ymd_max != sales_ymd_min')`

: This is a pandas DataFrame method that selects only the rows of the DataFrame where the value in the`sales_ymd_max`

column is not equal to the value in the`sales_ymd_min`

column. This is done using a boolean expression in a string that is passed as an argument to the`query()`

method.`.head(10)`

: This is a pandas DataFrame method that returns the first 10 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the maximum and minimum values of another column for each group, returning a DataFrame with the maximum and minimum values and a default index. It then selects only the rows where the maximum and minimum values are different and returns the first 10 rows of the resulting DataFrame. This code is useful for identifying customers who have made purchases on different dates.

P-027: For the receipt details data (df_receipt), calculate the average of the sales amount (amount) for each shop code (store_cd) and display the top 5 in descending order.

```
df_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index(). \
sort_values('amount', ascending=False).head(5)
```

store_cd | amount | |
---|---|---|

28 | S13052 | 402.867470 |

12 | S13015 | 351.111960 |

7 | S13003 | 350.915519 |

30 | S14010 | 348.791262 |

5 | S13001 | 348.470386 |

Commentary :

The code`df_receipt.groupby('store_cd').agg({'amount':'mean'}).reset_index().sort_values('amount', ascending=False).head(5)`

groups the rows of the`df_receipt`

DataFrame by the`store_cd`

column, and then calculates the mean of the`amount`

column for each group, returning a DataFrame with the mean values and the`store_cd`

column. It then resets the index of the resulting DataFrame and sorts it by the mean`amount`

values in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.

Here's an explanation of each component:`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`.groupby('store_cd')`

: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the`store_cd`

column. This creates a`GroupBy`

object that can be used to apply functions to each group separately.`.agg({'amount':'mean'})`

: This is a pandas GroupBy object method that applies the`mean()`

aggregation function to the`amount`

column of each group in the`GroupBy`

object. The result is a DataFrame with one column,`amount`

, that contains the mean value of the`amount`

column for each group.`.reset_index()`

: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.`.sort_values('amount', ascending=False)`

: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in the`amount`

column, in descending order. The`ascending=False`

parameter specifies that the sort order should be descending.`.head(5)`

: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mean value of another column for each group, returning a DataFrame with the mean values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the mean values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest average sales.

P-028: Calculate the median sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display the TOP 5 in descending order.

```
df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index(). \
sort_values('amount', ascending=False).head(5)
```

store_cd | amount | |
---|---|---|

28 | S13052 | 190.0 |

30 | S14010 | 188.0 |

51 | S14050 | 185.0 |

44 | S14040 | 180.0 |

7 | S13003 | 180.0 |

Commentary :

The code`df_receipt.groupby('store_cd').agg({'amount':'median'}).reset_index().sort_values('amount', ascending=False).head(5)`

groups the rows of the`df_receipt`

DataFrame by the`store_cd`

column, and then calculates the median of the`amount`

column for each group, returning a DataFrame with the median values and the`store_cd`

column. It then resets the index of the resulting DataFrame and sorts it by the median`amount`

values in descending order. Finally, it returns the first 5 rows of the resulting DataFrame.

Here's an explanation of each component:`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`.groupby('store_cd')`

: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the`store_cd`

column. This creates a`GroupBy`

object that can be used to apply functions to each group separately.`.agg({'amount':'median'})`

: This is a pandas GroupBy object method that applies the`median()`

aggregation function to the`amount`

column of each group in the`GroupBy`

object. The result is a DataFrame with one column,`amount`

, that contains the median value of the`amount`

column for each group.`.reset_index()`

: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.`.sort_values('amount', ascending=False)`

: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in the`amount`

column, in descending order. The`ascending=False`

parameter specifies that the sort order should be descending.`.head(5)`

: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the median value of another column for each group, returning a DataFrame with the median values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the median values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest median sales.

P-029: For receipt details data (df_receipt), calculate the mode value of the product code (product_cd) for each shop code (store_cd) and display 10 cases.

```
df_receipt.groupby('store_cd').product_cd. \
apply(lambda x: x.mode()).reset_index().head(10)
```

store_cd | level_1 | product_cd | |
---|---|---|---|

0 | S12007 | 0 | P060303001 |

1 | S12013 | 0 | P060303001 |

2 | S12014 | 0 | P060303001 |

3 | S12029 | 0 | P060303001 |

4 | S12030 | 0 | P060303001 |

5 | S13001 | 0 | P060303001 |

6 | S13002 | 0 | P060303001 |

7 | S13003 | 0 | P071401001 |

8 | S13004 | 0 | P060303001 |

9 | S13005 | 0 | P040503001 |

Commentary :

The code`df_receipt.groupby('store_cd').product_cd.apply(lambda x: x.mode()).reset_index().head(10)`

groups the rows of the`df_receipt`

DataFrame by the`store_cd`

column, and then finds the mode of the`product_cd`

column for each group. It then returns a DataFrame with the mode values and the`store_cd`

column. The`reset_index()`

method resets the index of the resulting DataFrame to a default index.

Here's an explanation of each component:`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`.groupby('store_cd')`

: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the`store_cd`

column. This creates a`GroupBy`

object that can be used to apply functions to each group separately.`.product_cd`

: This is a pandas DataFrame attribute that accesses the`product_cd`

column of the DataFrame.`.apply(lambda x: x.mode())`

: This is a pandas GroupBy object method that applies the`mode()`

function to each group in the`GroupBy`

object. The`mode()`

function returns the most frequently occurring value(s) in each group. The resulting object is a Series with the same length as the original group.`.reset_index()`

: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the mode value of another column for each group, returning a DataFrame with the mode values and the grouping column. The`mode()`

function is used to find the most frequently occurring product code for each store. This code is useful for identifying the most popular products sold at each store.

P-030: Calculate the variance of the sales amount (amount) for each shop code (store_cd) against the receipt details data (df_receipt) and display five cases in descending order.

```
df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index(). \
sort_values('amount', ascending=False).head(5)
```

store_cd | amount | |
---|---|---|

28 | S13052 | 440088.701311 |

31 | S14011 | 306314.558164 |

42 | S14034 | 296920.081011 |

5 | S13001 | 295431.993329 |

12 | S13015 | 295294.361116 |

Commentary :

The code`df_receipt.groupby('store_cd').amount.var(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)`

groups the rows of the`df_receipt`

DataFrame by the`store_cd`

column, and then calculates the variance of the`amount`

column for each group, with`ddof=0`

. It returns a DataFrame with the variance values and the`store_cd`

column. The`reset_index()`

method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the variance values in descending order and returns the first 5 rows.

Here's an explanation of each component:`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`.groupby('store_cd')`

: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the`store_cd`

column. This creates a`GroupBy`

object that can be used to apply functions to each group separately.`.amount.var(ddof=0)`

: This is a pandas DataFrame method that calculates the variance of the`amount`

column for each group in the`GroupBy`

object, with`ddof=0`

. The`ddof`

parameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample variance. A value of`0`

for`ddof`

indicates that the population variance should be calculated, rather than the sample variance. The result is a Series with the variance values and the`store_cd`

values as the index.`.reset_index()`

: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.`.sort_values('amount', ascending=False)`

: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in the`amount`

column, in descending order. The`ascending=False`

parameter specifies that the sort order should be descending.`.head(5)`

: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population variance of another column for each group, returning a DataFrame with the variance values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the variance values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.

P-031: Calculate the standard deviation of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and display 5 cases in descending order.

`TIPS:`

Note that the default values for ddof are different in Pandas and Numpy

Pandas：
DataFrame.std(self, axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Numpy:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)

```
df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index(). \
sort_values('amount', ascending=False).head(5)
```

store_cd | amount | |
---|---|---|

28 | S13052 | 663.391816 |

31 | S14011 | 553.456916 |

42 | S14034 | 544.903736 |

5 | S13001 | 543.536561 |

12 | S13015 | 543.409938 |

Commentary :

The code`df_receipt.groupby('store_cd').amount.std(ddof=0).reset_index().sort_values('amount', ascending=False).head(5)`

groups the rows of the`df_receipt`

DataFrame by the`store_cd`

column, and then calculates the standard deviation of the`amount`

column for each group, with`ddof=0`

. It returns a DataFrame with the standard deviation values and the`store_cd`

column. The`reset_index()`

method resets the index of the resulting DataFrame to a default index. It then sorts the resulting DataFrame by the standard deviation values in descending order and returns the first 5 rows.

Here's an explanation of each component:`df_receipt`

: This is a variable that represents a DataFrame object in pandas. The name`df_receipt`

could be any valid variable name.`.groupby('store_cd')`

: This is a pandas DataFrame method that groups the rows of the DataFrame by the values in the`store_cd`

column. This creates a`GroupBy`

object that can be used to apply functions to each group separately.`.amount.std(ddof=0)`

: This is a pandas DataFrame method that calculates the standard deviation of the`amount`

column for each group in the`GroupBy`

object, with`ddof=0`

. The`ddof`

parameter specifies the delta degrees of freedom, which is the divisor used in the calculation of the sample standard deviation. A value of`0`

for`ddof`

indicates that the population standard deviation should be calculated, rather than the sample standard deviation. The result is a Series with the standard deviation values and the`store_cd`

values as the index.`.reset_index()`

: This is a pandas DataFrame method that resets the index of the resulting DataFrame to a default index.`.sort_values('amount', ascending=False)`

: This is a pandas DataFrame method that sorts the rows of the DataFrame by the values in the`amount`

column, in descending order. The`ascending=False`

parameter specifies that the sort order should be descending.`.head(5)`

: This is a pandas DataFrame method that returns the first 5 rows of the DataFrame.

In summary, the code is grouping the rows of a DataFrame by the values in a specific column, and then finding the population standard deviation of another column for each group, returning a DataFrame with the standard deviation values and the grouping column. It then resets the index of the resulting DataFrame and sorts it by the standard deviation values in descending order, and returns the first 5 rows of the resulting DataFrame. This code is useful for identifying the stores with the highest variability in sales.

P-032: Find the percentile values for the sales amount (amount) in the receipt details data (df_receipt) in 25% increments.

```
# Code Example 1
np.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)
```

array([ 102., 170., 288., 10925.])

Commentary :

The code`np.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25)`

computes the quartiles of the values in the`amount`

column of the`df_receipt`

DataFrame using the`np.percentile()`

function from the NumPy library.

Here's an explanation of each component:`np.percentile()`

: This is a NumPy function that computes the percentiles of an array. It takes two arguments: the first argument is the array for which percentiles are to be calculated, and the second argument is the percentile values to compute. The`q`

parameter is used to specify the percentiles to compute. For example,`q=np.arange(1, 5) * 25`

computes the percentiles at positions 25, 50, 75 in the distribution of the data. This is because the`np.arange(1, 5) * 25`

expression generates an array with values`[25, 50, 75]`

, which are the positions of the percentiles in the distribution.`df_receipt['amount']`

: This is a pandas Series object that represents the`amount`

column of the`df_receipt`

DataFrame. The`df_receipt`

DataFrame is assumed to exist in the current environment.`q=np.arange(1, 5) * 25`

: This is a NumPy array that specifies the percentiles to compute. The`np.arange(1, 5) * 25`

expression generates an array with values`[25, 50, 75]`

, which are the positions of the percentiles in the distribution.

The code computes the quartiles of the`amount`

column of the`df_receipt`

DataFrame, which are the values that divide the distribution into four equal parts. The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the 50th percentile (also known as the median), and the third quartile (Q3) is the 75th percentile. The output of the code is an array of the quartile values, which can be used to summarize the distribution of the`amount`

column.

```
# Code Example 2
df_receipt.amount.quantile(q=np.arange(1, 5) / 4)
```

0.25 102.0 0.50 170.0 0.75 288.0 1.00 10925.0 Name: amount, dtype: float64

Commentary :

This code calculates the quartiles of the`amount`

column in the`df_receipt`

dataframe using the`quantile()`

function from pandas. The`q`

parameter specifies the quantiles to compute, which are the 1/4, 2/4 (median), 3/4, and 4/4 (maximum) quantiles.

The code calculates the quartiles by calling`quantile()`

on the`amount`

column of`df_receipt`

with the`q`

parameter set to an array of values`[0.25, 0.5, 0.75, 1.0]`

which corresponds to the quartiles 1, 2, 3 and 4 respectively.

The resulting output of this code would be an array of four values, each representing the value of the corresponding quartile.

P-033: Calculate the average of the sales amount (amount) for each shop code (store_cd) for the receipt details data (df_receipt) and extract those that are 330 or more.

```
df_receipt.groupby('store_cd').amount.mean(). \
reset_index().query('amount >= 330')
```

store_cd | amount | |
---|---|---|

1 | S12013 | 330.194130 |

5 | S13001 | 348.470386 |

7 | S13003 | 350.915519 |

8 | S13004 | 330.943949 |

12 | S13015 | 351.111960 |

16 | S13019 | 330.208616 |

17 | S13020 | 337.879932 |

28 | S13052 | 402.867470 |

30 | S14010 | 348.791262 |

31 | S14011 | 335.718333 |

38 | S14026 | 332.340588 |

46 | S14045 | 330.082073 |

48 | S14047 | 330.077073 |

Commentary :

The code`df_receipt.groupby('store_cd').amount.mean().reset_index().query('amount >= 330')`

groups the`df_receipt`

DataFrame by the`store_cd`

column, calculates the mean value of the`amount`

column for each group, resets the index of the resulting DataFrame, and then filters the resulting DataFrame to only include rows where the mean value of`amount`

is greater than or equal to 330.

Here's an explanation of each component:`df_receipt.groupby('store_cd')`

: This groups the`df_receipt`

DataFrame by the values in the`store_cd`

column.`.amount.mean()`

: This calculates the mean value of the`amount`

column for each group.`.reset_index()`

: This resets the index of the resulting DataFrame.`.query('amount >= 330')`

: This filters the resulting DataFrame to only include rows where the mean value of`amount`

is greater than or equal to 330.

In summary, this code selects only the stores whose average transaction amount is greater than or equal to 330 yen.

P-034: For the receipt details data (df_receipt), sum the amount of sales (amount) for each customer ID (customer_id) and find the average for all customers. However, exclude customer IDs starting with “Z” as they represent non-members.

```
# # Code example 1: Writing without query
df_receipt[~df_receipt['customer_id'].str.startswith("Z")]. \
groupby('customer_id').amount.sum().mean()
```

2547.742234529256

Commentary :

The code`df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean()`

does the following:`df_receipt[~df_receipt['customer_id'].str.startswith("Z")]`

: This filters the`df_receipt`

DataFrame to exclude rows where the`customer_id`

column starts with the letter "Z". The tilde (~) character is used as a negation operator to invert the boolean values returned by the`str.startswith()`

method.`.groupby('customer_id').amount.sum()`

: This groups the filtered DataFrame by the`customer_id`

column, and calculates the sum of the`amount`

column for each group.`.mean()`

: This calculates the mean value of the resulting Series, which represents the total amount spent by each customer (excluding customers whose`customer_id`

starts with "Z").

In summary, this code computes the average amount spent by each customer whose`customer_id`

does not start with the letter "Z" in the`df_receipt`

DataFrame. This may be useful for understanding the purchasing behavior of regular customers.

```
# Code example 2: Writing with query
df_receipt.query('not customer_id.str.startswith("Z")',
engine='python').groupby('customer_id').amount.sum().mean()
```

2547.742234529256

Commentary :

This code computes the average amount spent by non-anonymous customers. Here's how it works:`df_receipt`

is a DataFrame containing information about transactions.

The`query`

method is used to filter out transactions where the`customer_id`

starts with the letter "Z". This is done using the`str.startswith()`

method, which checks if a string starts with a certain character or substring.

The resulting DataFrame is grouped by`customer_id`

using the`groupby`

method.

The`amount`

column is selected using the`sum`

method to compute the total amount spent by each customer.

The`mean`

method is then used to compute the average of the amounts computed in the previous step. This is the final result of the code.

P-035: For the receipt details data (df_receipt), sum the sales amount (amount) for each customer ID (customer_id) to obtain the average of all customers, extract the customers who spend more than the average and display 10 items. However, exclude customer IDs starting with “Z” as they represent non-members.

```
df_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].\
groupby('customer_id').amount.sum()
amount_mean = df_amount_sum.mean()
df_amount_sum = df_amount_sum.reset_index()
df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)
```

customer_id | amount | |
---|---|---|

2 | CS001115000010 | 3044 |

4 | CS001205000006 | 3337 |

13 | CS001214000009 | 4685 |

14 | CS001214000017 | 4132 |

17 | CS001214000052 | 5639 |

21 | CS001215000040 | 3496 |

30 | CS001304000006 | 3726 |

32 | CS001305000005 | 3485 |

33 | CS001305000011 | 4370 |

53 | CS001315000180 | 3300 |

Commentary :

The code`df_amount_sum = df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum()`

,`amount_mean = df_amount_sum.mean()`

,`df_amount_sum = df_amount_sum.reset_index()`

, and`df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)`

does the following:`df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum()`

: This filters the`df_receipt`

DataFrame to exclude rows where the`customer_id`

column starts with the letter "Z", groups the resulting DataFrame by the`customer_id`

column, and calculates the sum of the`amount`

column for each group. The resulting object is a Series that represents the total amount spent by each customer (excluding customers whose`customer_id`

starts with "Z").`amount_mean = df_amount_sum.mean()`

: This calculates the mean value of the`df_amount_sum`

Series, which represents the average amount spent by each customer whose`customer_id`

does not start with the letter "Z" in the`df_receipt`

DataFrame.`df_amount_sum = df_amount_sum.reset_index()`

: This resets the index of the`df_amount_sum`

Series and converts it to a DataFrame with two columns:`customer_id`

and`amount`

.`df_amount_sum[df_amount_sum['amount'] >= amount_mean].head(10)`

: This filters the`df_amount_sum`

DataFrame to include only rows where the`amount`

column is greater than or equal to`amount_mean`

, and returns the first 10 rows of the resulting DataFrame.

In summary, this code selects the customers whose total spending is greater than or equal to the average spending of all customers whose`customer_id`

does not start with the letter "Z" in the`df_receipt`

DataFrame. The resulting DataFrame contains the`customer_id`

and total amount spent for each selected customer, sorted by descending order of total spending.

P-036: Combine receipt details data (df_receipt) and shop data (df_store) internally and display all items of receipt details data and store name (store_name) of shop data for 10 items.

```
pd.merge(df_receipt, df_store[['store_cd','store_name']],
how='inner', on='store_cd').head(10)
```

sales_ymd | sales_epoch | store_cd | receipt_no | receipt_sub_no | customer_id | product_cd | quantity | amount | store_name | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 20181103 | 1541203200 | S14006 | 112 | 1 | CS006214000001 | P070305012 | 1 | 158 | 葛が谷店 |

1 | 20181116 | 1542326400 | S14006 | 112 | 2 | ZZ000000000000 | P080401001 | 1 | 48 | 葛が谷店 |

2 | 20170118 | 1484697600 | S14006 | 1162 | 1 | CS006815000006 | P050406035 | 1 | 220 | 葛が谷店 |

3 | 20190524 | 1558656000 | S14006 | 1192 | 1 | CS006514000034 | P060104003 | 1 | 80 | 葛が谷店 |

4 | 20190419 | 1555632000 | S14006 | 112 | 2 | ZZ000000000000 | P060501002 | 1 | 148 | 葛が谷店 |

5 | 20181119 | 1542585600 | S14006 | 1152 | 2 | ZZ000000000000 | P050701001 | 1 | 88 | 葛が谷店 |

6 | 20171211 | 1512950400 | S14006 | 1132 | 2 | CS006515000175 | P090903001 | 1 | 80 | 葛が谷店 |

7 | 20191021 | 1571616000 | S14006 | 1112 | 2 | CS006415000221 | P040602001 | 1 | 405 | 葛が谷店 |

8 | 20170710 | 1499644800 | S14006 | 1132 | 2 | CS006411000036 | P090301051 | 1 | 330 | 葛が谷店 |

9 | 20190805 | 1564963200 | S14006 | 112 | 1 | CS006211000012 | P050104001 | 1 | 115 | 葛が谷店 |

Commentary :

The code`pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd').head(10)`

does the following:`df_store[['store_cd','store_name']]`

: This selects a subset of the`df_store`

DataFrame with two columns:`store_cd`

and`store_name`

.`pd.merge(df_receipt, df_store[['store_cd','store_name']], how='inner', on='store_cd')`

: This performs an inner join between the`df_receipt`

DataFrame and the`df_store[['store_cd','store_name']]`

DataFrame on the`store_cd`

column. This will merge the two DataFrames on the`store_cd`

column, and only include rows that have matching values in both DataFrames.`.head(10)`

: This returns the first 10 rows of the resulting merged DataFrame.

In summary, this code merges the`df_receipt`

DataFrame with the`df_store`

DataFrame on the`store_cd`

column, and adds the`store_name`

column from the`df_store`

DataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the stores where the transactions in`df_receipt`

occurred, including the`store_cd`

,`store_name`

, and all other columns from the`df_receipt`

DataFrame.

P-037: Join product data (df_product) and category data (df_category) internally and display all items of the product data and 10 category sub-category names (category_small_name) of the category data.

```
pd.merge(df_product
, df_category[['category_small_cd','category_small_name']]
, how='inner', on='category_small_cd').head(10)
```

product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_small_name | |
---|---|---|---|---|---|---|---|

0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 弁当類 |

1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 弁当類 |

2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 弁当類 |

3 | P040101004 | 04 | 0401 | 040101 | 248.0 | 186.0 | 弁当類 |

4 | P040101005 | 04 | 0401 | 040101 | 268.0 | 201.0 | 弁当類 |

5 | P040101006 | 04 | 0401 | 040101 | 298.0 | 224.0 | 弁当類 |

6 | P040101007 | 04 | 0401 | 040101 | 338.0 | 254.0 | 弁当類 |

7 | P040101008 | 04 | 0401 | 040101 | 420.0 | 315.0 | 弁当類 |

8 | P040101009 | 04 | 0401 | 040101 | 498.0 | 374.0 | 弁当類 |

9 | P040101010 | 04 | 0401 | 040101 | 580.0 | 435.0 | 弁当類 |

Commentary :

The code`pd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd').head(10)`

does the following:`df_category[['category_small_cd','category_small_name']]`

: This selects a subset of the`df_category`

DataFrame with two columns:`category_small_cd`

and`category_small_name`

.`pd.merge(df_product , df_category[['category_small_cd','category_small_name']] , how='inner', on='category_small_cd')`

: This performs an inner join between the`df_product`

DataFrame and the`df_category[['category_small_cd','category_small_name']]`

DataFrame on the`category_small_cd`

column. This will merge the two DataFrames on the`category_small_cd`

column, and only include rows that have matching values in both DataFrames.`.head(10)`

: This returns the first 10 rows of the resulting merged DataFrame.

In summary, this code merges the`df_product`

DataFrame with the`df_category`

DataFrame on the`category_small_cd`

column, and adds the`category_small_name`

column from the`df_category`

DataFrame to the resulting merged DataFrame. The resulting DataFrame contains information about the products in`df_product`

, including the`category_small_cd`

,`category_small_name`

, and all other columns from the`df_product`

DataFrame.

P-038: Calculate the total sales amount for each customer from the customer data (df_customer) and receipt details data (df_receipt) and display 10 items. However, for customers with no sales records, the sales amount should be displayed as 0. Customers whose gender code (gender_cd) is female (1) should be included, and non-members (whose customer ID starts with “Z”) should be excluded.

```
df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()
df_tmp = df_customer. \
query('gender_cd == "1" and not customer_id.str.startswith("Z")',
engine='python')
pd.merge(df_tmp['customer_id'], df_amount_sum,
how='left', on='customer_id').fillna(0).head(10)
```

customer_id | amount | |
---|---|---|

0 | CS021313000114 | 0.0 |

1 | CS031415000172 | 5088.0 |

2 | CS028811000001 | 0.0 |

3 | CS001215000145 | 875.0 |

4 | CS015414000103 | 3122.0 |

5 | CS033513000180 | 868.0 |

6 | CS035614000014 | 0.0 |

7 | CS011215000048 | 3444.0 |

8 | CS009413000079 | 0.0 |

9 | CS040412000191 | 210.0 |

Commentary :

The code`df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index() df_tmp = df_customer. query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python') pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id').fillna(0).head(10)`

does the following:`df_amount_sum = df_receipt.groupby('customer_id').amount.sum().reset_index()`

: This creates a new DataFrame`df_amount_sum`

by grouping the`df_receipt`

DataFrame by`customer_id`

and summing the`amount`

column for each customer. The resulting DataFrame has two columns:`customer_id`

and`amount`

.`df_customer.query('gender_cd == "1" and not customer_id.str.startswith("Z")', engine='python')`

: This selects a subset of the`df_customer`

DataFrame where`gender_cd`

equals 1 and`customer_id`

does not start with "Z". The`engine='python'`

argument is used to allow the use of string methods like`startswith`

.`pd.merge(df_tmp['customer_id'], df_amount_sum, how='left', on='customer_id')`

: This merges the`customer_id`

column from the`df_tmp`

DataFrame with the`df_amount_sum`

DataFrame on the`customer_id`

column using a left join. This includes all rows from the`df_tmp`

DataFrame and adds the`amount`

column from the`df_amount_sum`

DataFrame, where available.`.fillna(0)`

: This fills any missing values in the`amount`

column with 0.`.head(10)`

: This returns the first 10 rows of the resulting merged DataFrame.

In summary, this code calculates the total amount spent by each customer in the`df_receipt`

DataFrame, and then selects a subset of customers from the`df_customer`

DataFrame who are male and have valid`customer_id`

values. The code then merges the two DataFrames on the`customer_id`

column using a left join, and includes the total amount spent by each customer in the resulting DataFrame. Any missing values in the`amount`

column are filled with 0. The resulting DataFrame contains the`customer_id`

and`amount`

columns for the selected male customers.

P-039: From the receipt details data (df_receipt), create a data set containing the top 20 customers with the highest number of days of sales and a data set containing the top 20 customers with the highest total sales value, respectively, and furthermore, merge these two completely externally. However, exclude non-members (whose customer ID starts with “Z”).

```
df_data = df_receipt \
.query('not customer_id.str.startswith("Z")', engine='python')
df_cnt = df_data[~df_data.duplicated(subset=['customer_id', 'sales_ymd'])] \
.groupby('customer_id').sales_ymd.count().reset_index() \
.sort_values('sales_ymd', ascending=False).head(20)
df_sum = df_data.groupby('customer_id').amount.sum().reset_index() \
.sort_values('amount', ascending=False).head(20)
pd.merge(df_cnt, df_sum, how='outer', on='customer_id')
```

customer_id | sales_ymd | amount | |
---|---|---|---|

0 | CS040214000008 | 23.0 | NaN |

1 | CS015415000185 | 22.0 | 20153.0 |

2 | CS010214000010 | 22.0 | 18585.0 |

3 | CS010214000002 | 21.0 | NaN |

4 | CS028415000007 | 21.0 | 19127.0 |

5 | CS017415000097 | 20.0 | 23086.0 |

6 | CS016415000141 | 20.0 | 18372.0 |

7 | CS031414000051 | 19.0 | 19202.0 |

8 | CS014214000023 | 19.0 | NaN |

9 | CS022515000226 | 19.0 | NaN |

10 | CS021515000172 | 19.0 | NaN |

11 | CS039414000052 | 19.0 | NaN |

12 | CS021514000045 | 19.0 | NaN |

13 | CS022515000028 | 18.0 | NaN |

14 | CS030214000008 | 18.0 | NaN |

15 | CS021515000056 | 18.0 | NaN |

16 | CS014415000077 | 18.0 | NaN |

17 | CS021515000211 | 18.0 | NaN |

18 | CS032415000209 | 18.0 | NaN |

19 | CS031414000073 | 18.0 | NaN |

20 | CS001605000009 | NaN | 18925.0 |

21 | CS006515000023 | NaN | 18372.0 |

22 | CS011414000106 | NaN | 18338.0 |

23 | CS038415000104 | NaN | 17847.0 |

24 | CS035414000024 | NaN | 17615.0 |

25 | CS021515000089 | NaN | 17580.0 |

26 | CS032414000072 | NaN | 16563.0 |

27 | CS016415000101 | NaN | 16348.0 |

28 | CS011415000006 | NaN | 16094.0 |

29 | CS034415000047 | NaN | 16083.0 |

30 | CS007514000094 | NaN | 15735.0 |

31 | CS009414000059 | NaN | 15492.0 |

32 | CS030415000034 | NaN | 15468.0 |

33 | CS015515000034 | NaN | 15300.0 |

Commentary :

This code is performing the following steps:

Filtering the rows in the`df_receipt`

dataframe where the`customer_id`

column does not start with "Z" using the`.query()`

method and saving it to`df_data`

.

Removing duplicate rows from`df_data`

based on`customer_id`

and`sales_ymd`

columns, then grouping by`customer_id`

, counting the number of sales dates and saving it to`df_cnt`

.

Sorting`df_cnt`

by the`sales_ymd`

column in descending order and taking the top 20 rows.

Grouping`df_data`

by`customer_id`

and summing the`amount`

column, then sorting by`amount`

column in descending order and taking the top 20 rows, and saving it to`df_sum`

.

Merging`df_cnt`

and`df_sum`

dataframes on the`customer_id`

column using the`pd.merge()`

method with`how='outer'`

parameter, which means that all records from both dataframes will be included in the merged dataframe, and saving it to`df_data`

. The resulting dataframe will have the top 20 customers based on the count of sales dates and the sum of their purchase amount.

P-040: You want to create data combining all shops and all products. Direct product the shop data (df_store) and the product data (df_product) and calculate the number of cases.

df_store_tmp = df_store.copy() df_product_tmp = df_product.copy() df_store_tmp['key'] = 0 df_product_tmp['key'] = 0 len(pd.merge(df_store_tmp, df_product_tmp, how='outer', on='key'))

531590

Commentary :

This code merges the two dataframes`df_store`

and`df_product`

using an outer join.

First, two temporary dataframes`df_store_tmp`

and`df_product_tmp`

are created as copies of`df_store`

and`df_product`

, respectively. Then, a new column called`key`

is added to both dataframes with a constant value of 0. This is done because an outer join requires a common column to merge on.

Finally, the two dataframes are merged using the`pd.merge()`

function, with`how='outer'`

argument indicating an outer join is desired and`on='key'`

specifying the common column to merge on. The resulting dataframe contains all rows from both dataframes, with missing values (`NaN`

) in the cells where there was no corresponding data in the other dataframe.

Lastly,`len()`

is used to get the number of rows in the merged dataframe. Since both dataframes have the same number of rows, the output would be the same whether the inner, left, right, or outer join was used. In this case, it will output the total number of rows in the merged dataframe.

## Comment