参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Getting Started
- First, run the following cells
- Import the necessary libraries and read data from the database (PostgreSQL)
- The following cell imports libraries that are expected to be used, such as pandas.
- If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
- The process can be divided into multiple times.
- Names, addresses, etc. are dummy data and are not real.
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
if 'PG_PORT' in os.environ:
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
# Connector for pd.read_sql
conn = psycopg2.connect(**pgconfig)
df_customer = pd.read_sql(sql='select * from customer', con=conn)
df_category = pd.read_sql(sql='select * from category', con=conn)
df_product = pd.read_sql(sql='select * from product', con=conn)
df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
df_store = pd.read_sql(sql='select * from store', con=conn)
df_geocode = pd.read_sql(sql='select * from geocode', con=conn)
else:
if not os.path.exists('data/'):
!git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
os.chdir('100knocks-preprocess/docker/work/')
dtype = {
'customer_id': str,
'gender_cd': str,
'postal_cd': str,
'application_store_cd': str,
'status_cd': str,
'category_major_cd': str,
'category_medium_cd': str,
'category_small_cd': str,
'product_cd': str,
'store_cd': str,
'prefecture_cd': str,
'tel_no': str,
'postal_cd': str,
'street': str
}
df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
df_category = pd.read_csv("data/category.csv", dtype=dtype)
df_product = pd.read_csv("data/product.csv", dtype=dtype)
df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
df_store = pd.read_csv("data/store.csv", dtype=dtype)
df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)
fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.
Exercise
P-061: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to ordinary logarithm (bottom 10) and display 10 items with customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python'). \
groupby('customer_id'). \
agg({'amount':'sum'}).reset_index()
df_sales_amount['log_amount'] = np.log10(df_sales_amount['amount'] + 0.5)
df_sales_amount.head(10)
customer_id | amount | log_amount | |
---|---|---|---|
0 | CS001113000004 | 1298 | 3.113442 |
1 | CS001114000005 | 626 | 2.796921 |
2 | CS001115000010 | 3044 | 3.483516 |
3 | CS001205000004 | 1988 | 3.298526 |
4 | CS001205000006 | 3337 | 3.523421 |
5 | CS001211000025 | 456 | 2.659441 |
6 | CS001212000027 | 448 | 2.651762 |
7 | CS001212000031 | 296 | 2.472025 |
8 | CS001212000046 | 228 | 2.358886 |
9 | CS001212000070 | 456 | 2.659441 |
Commentary :
The code performs the following actions:
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
It selects all rows from a DataFrame called df_receipt
where the customer_id
column does not start with the letter "Z". The resulting DataFrame is saved as df_sales_amount
.
df_sales_amount = df_sales_amount.groupby('customer_id').agg({'amount':'sum'}).reset_index()
It groups the df_sales_amount
DataFrame by customer_id
and calculates the sum of the amount
column for each group using the agg
method. The resulting DataFrame is saved as df_sales_amount
.
df_sales_amount['log_amount'] = np.log10(df_sales_amount['amount'] + 0.5)
It calculates the logarithm (base 10) of the amount
column in the df_sales_amount
DataFrame, adding 0.5 to the amount
values first to avoid taking the logarithm of 0 or negative values. The resulting logarithmic values are saved in a new column called log_amount
.
df_sales_amount.head(10)
It prints the first 10 rows of the resulting df_sales_amount
DataFrame using the head
method.
In summary, the code selects rows from a DataFrame where the customer ID does not start with "Z", groups those rows by customer ID, calculates the sum of sales amounts for each customer, takes the logarithm (base 10) of the sales amounts, and prints the first 10 rows of the resulting DataFrame.
P-062: Sum the sales amount (amount) of the receipt details data (df_receipt) for each customer ID (customer_id), convert the total sales amount to the natural logarithm (bottom e) and display 10 items with the customer ID and total sales amount. However, exclude customer IDs starting with “Z” as they represent non-members.
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python'). \
groupby('customer_id'). \
agg({'amount':'sum'}).reset_index()
df_sales_amount['log_amount'] = np.log(df_sales_amount['amount'] + 0.5)
df_sales_amount.head(10)
customer_id | amount | log_amount | |
---|---|---|---|
0 | CS001113000004 | 1298 | 7.168965 |
1 | CS001114000005 | 626 | 6.440149 |
2 | CS001115000010 | 3044 | 8.021092 |
3 | CS001205000004 | 1988 | 7.595136 |
4 | CS001205000006 | 3337 | 8.112977 |
5 | CS001211000025 | 456 | 6.123589 |
6 | CS001212000027 | 448 | 6.105909 |
7 | CS001212000031 | 296 | 5.692047 |
8 | CS001212000046 | 228 | 5.431536 |
9 | CS001212000070 | 456 | 6.123589 |
Commentary :
This code performs some data processing on a pandas DataFrame calleddf_receipt
.
First, it filters out any rows where thecustomer_id
column starts with the letter "Z". This is done using thequery()
method with a condition that uses thestr.startswith()
method. Theengine='python'
argument is provided to use the Python parser instead of the default pandas parser for string operations.
The resulting DataFrame is then grouped by thecustomer_id
column using thegroupby()
method.
Theagg()
method is used to apply a function to theamount
column for each group. In this case, thesum()
function is used to calculate the total sales amount for each customer.
The resulting DataFrame is then reset to have a new index usingreset_index()
.
Finally, a new column calledlog_amount
is created by taking the natural logarithm of theamount
column plus 0.5 (to avoid taking the logarithm of zero or negative values).
The first 10 rows of the resulting DataFrame are displayed using thehead()
method.
P-063: Calculate the profit amount of each product from the unit price(unit price) and cost of the product(unit_cost) data (df_product) and display 10 results.
df_tmp = df_product.copy()
df_tmp['unit_profit'] = df_tmp['unit_price'] - df_tmp['unit_cost']
df_tmp.head(10)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | unit_profit | |
---|---|---|---|---|---|---|---|
0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 49.0 |
1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 54.0 |
2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 57.0 |
3 | P040101004 | 04 | 0401 | 040101 | 248.0 | 186.0 | 62.0 |
4 | P040101005 | 04 | 0401 | 040101 | 268.0 | 201.0 | 67.0 |
5 | P040101006 | 04 | 0401 | 040101 | 298.0 | 224.0 | 74.0 |
6 | P040101007 | 04 | 0401 | 040101 | 338.0 | 254.0 | 84.0 |
7 | P040101008 | 04 | 0401 | 040101 | 420.0 | 315.0 | 105.0 |
8 | P040101009 | 04 | 0401 | 040101 | 498.0 | 374.0 | 124.0 |
9 | P040101010 | 04 | 0401 | 040101 | 580.0 | 435.0 | 145.0 |
Commentary :
The code performs the following actions:
df_tmp = df_product.copy()
It creates a copy of a DataFrame called df_product
and assigns it to a new variable called df_tmp
. The copy is made to avoid modifying the original DataFrame.
df_tmp['unit_profit'] = df_tmp['unit_price'] - df_tmp['unit_cost']
It calculates the profit per unit for each product by subtracting the unit cost from the unit price, and saves the result in a new column called unit_profit
. The resulting DataFrame is saved as df_tmp
.
df_tmp.head(10)
It prints the first 10 rows of the df_tmp
DataFrame using the head
method.
In summary, the code creates a copy of a DataFrame, calculates the profit per unit for each product by subtracting the unit cost from the unit price, adds a new column with the profit per unit to the copied DataFrame, and prints the first 10 rows of the resulting DataFrame.
P-064: Calculate the overall average profit margin for each product from the unit price (unit_price) and cost (unit_cost) of the product data (df_product). Note, however, that unit_price and cost are missing.
df_tmp = df_product.copy()
df_tmp['unit_profit_rate'] = \
(df_tmp['unit_price'] - df_tmp['unit_cost']) / df_tmp['unit_price']
df_tmp['unit_profit_rate'].mean(skipna=True)
0.24911389885177
Commentary :
The code performs the following actions:
df_tmp = df_product.copy()
It creates a copy of a DataFrame called df_product
and assigns it to a new variable called df_tmp
. The copy is made to avoid modifying the original DataFrame.
df_tmp['unit_profit_rate'] = (df_tmp['unit_price'] - df_tmp['unit_cost']) / df_tmp['unit_price']
It calculates the profit rate per unit for each product by dividing the profit (unit price minus unit cost) by the unit price, and saves the result in a new column called unit_profit_rate
. The resulting DataFrame is saved as df_tmp
.
df_tmp['unit_profit_rate'].mean(skipna=True)
It calculates the mean of the unit_profit_rate
column in the df_tmp
DataFrame using the mean
method, skipping missing values (NaN) if any, and prints the result
In summary, the code creates a copy of a DataFrame, calculates the profit rate per unit for each product by dividing the profit (unit price minus unit cost) by the unit price, adds a new column with the profit rate per unit to the copied DataFrame, calculates the mean of the profit rate column, and prints the resulting mean value.
P-065: For each product in the product data (df_product), find the new unit price that gives a profit margin of 30%. However, round down to the nearest yen. Then display 10 results and confirm that the profit margin is approximately around 30%. Note that there is a deficit in unit_price(unit_price)and unit_cost(unit_cost).
df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
df_tmp['new_price'] = np.floor(df_tmp['unit_cost'] / 0.7)
df_tmp['new_profit_rate'] = \
(df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
df_tmp.head(10)
product_cd | unit_price | unit_cost | new_price | new_profit_rate | |
---|---|---|---|---|---|
0 | P040101001 | 198.0 | 149.0 | 212.0 | 0.297170 |
1 | P040101002 | 218.0 | 164.0 | 234.0 | 0.299145 |
2 | P040101003 | 230.0 | 173.0 | 247.0 | 0.299595 |
3 | P040101004 | 248.0 | 186.0 | 265.0 | 0.298113 |
4 | P040101005 | 268.0 | 201.0 | 287.0 | 0.299652 |
5 | P040101006 | 298.0 | 224.0 | 320.0 | 0.300000 |
6 | P040101007 | 338.0 | 254.0 | 362.0 | 0.298343 |
7 | P040101008 | 420.0 | 315.0 | 450.0 | 0.300000 |
8 | P040101009 | 498.0 | 374.0 | 534.0 | 0.299625 |
9 | P040101010 | 580.0 | 435.0 | 621.0 | 0.299517 |
Commentary :
The code performs the following actions:
df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
It creates a new DataFrame called df_tmp
that includes only three columns of interest: product_cd
, unit_price
, and unit_cost
, and makes a copy of these columns from the original df_product
DataFrame.
df_tmp['new_price'] = np.floor(df_tmp['unit_cost'] / 0.7)
It calculates a new price for each product by dividing the unit cost by 0.7 (a markup factor of 1.43), rounding down to the nearest integer using the floor
method, and saves the result in a new column called new_price
. The resulting DataFrame is saved as df_tmp
.
df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
It calculates the profit rate per unit for each product using the new price calculated in the previous step, and saves the result in a new column called new_profit_rate
. The resulting DataFrame is saved as df_tmp
.
df_tmp.head(10)
It prints the first 10 rows of the df_tmp
DataFrame using the head
method.
In summary, the code creates a new DataFrame with three columns of interest, calculates a new price for each product based on a markup factor of 1.43, calculates the profit rate per unit for each product using the new price, adds two new columns to the copied DataFrame, and prints the first 10 rows of the resulting DataFrame.
P-066: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round to the nearest yen (rounding or rounding to even numbers is fine). Then display 10 results and check that the profit margin is around 30%. Note, however, that there are deficiencies in unit_price(unit_price) and unit_cost(unit_cost).
df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
df_tmp['new_price'] = np.round(df_tmp['unit_cost'] / 0.7)
df_tmp['new_profit_rate'] = \
(df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
df_tmp.head(10)
product_cd | unit_price | unit_cost | new_price | new_profit_rate | |
---|---|---|---|---|---|
0 | P040101001 | 198.0 | 149.0 | 213.0 | 0.300469 |
1 | P040101002 | 218.0 | 164.0 | 234.0 | 0.299145 |
2 | P040101003 | 230.0 | 173.0 | 247.0 | 0.299595 |
3 | P040101004 | 248.0 | 186.0 | 266.0 | 0.300752 |
4 | P040101005 | 268.0 | 201.0 | 287.0 | 0.299652 |
5 | P040101006 | 298.0 | 224.0 | 320.0 | 0.300000 |
6 | P040101007 | 338.0 | 254.0 | 363.0 | 0.300275 |
7 | P040101008 | 420.0 | 315.0 | 450.0 | 0.300000 |
8 | P040101009 | 498.0 | 374.0 | 534.0 | 0.299625 |
9 | P040101010 | 580.0 | 435.0 | 621.0 | 0.299517 |
Commentary :
The code performs the following actions:
df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
It creates a new DataFrame called df_tmp
that includes only three columns of interest: product_cd
, unit_price
, and unit_cost
, and makes a copy of these columns from the original df_product
DataFrame.
df_tmp['new_price'] = np.round(df_tmp['unit_cost'] / 0.7)
It calculates a new price for each product by dividing the unit cost by 0.7 (a markup factor of 1.43), rounding to the nearest integer using the round
method, and saves the result in a new column called new_price
. The resulting DataFrame is saved as df_tmp
.
df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
It calculates the profit rate per unit for each product using the new price calculated in the previous step, and saves the result in a new column called new_profit_rate
. The resulting DataFrame is saved as df_tmp
.
df_tmp.head(10)
It prints the first 10 rows of the df_tmp
DataFrame using the head
method.
In summary, the code creates a new DataFrame with three columns of interest, calculates a new price for each product based on a markup factor of 1.43 and rounding to the nearest integer, calculates the profit rate per unit for each product using the new price, adds two new columns to the copied DataFrame, and prints the first 10 rows of the resulting DataFrame.
P-067: For each product in the product data (df_product), find a new unit price that gives a profit margin of 30%. This time, round up to the nearest yen. Then display 10 results and check that the profit margin is around 30%. Note, however, that there is a deficit in the unit price (unit_price) and cost (unit_cost).
df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
df_tmp['new_price'] = np.ceil(df_tmp['unit_cost'] / 0.7)
df_tmp['new_profit_rate'] = \
(df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
df_tmp.head(10)
product_cd | unit_price | unit_cost | new_price | new_profit_rate | |
---|---|---|---|---|---|
0 | P040101001 | 198.0 | 149.0 | 213.0 | 0.300469 |
1 | P040101002 | 218.0 | 164.0 | 235.0 | 0.302128 |
2 | P040101003 | 230.0 | 173.0 | 248.0 | 0.302419 |
3 | P040101004 | 248.0 | 186.0 | 266.0 | 0.300752 |
4 | P040101005 | 268.0 | 201.0 | 288.0 | 0.302083 |
5 | P040101006 | 298.0 | 224.0 | 320.0 | 0.300000 |
6 | P040101007 | 338.0 | 254.0 | 363.0 | 0.300275 |
7 | P040101008 | 420.0 | 315.0 | 451.0 | 0.301552 |
8 | P040101009 | 498.0 | 374.0 | 535.0 | 0.300935 |
9 | P040101010 | 580.0 | 435.0 | 622.0 | 0.300643 |
Commentary :
This code manipulats a pandas DataFrame object nameddf_product
.
Here's a breakdown of the code line-by-line:df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
: This line creates a new DataFrame object nameddf_tmp
that is a subset ofdf_product
and includes only the columns "product_cd", "unit_price", and "unit_cost". The.copy()
method is used to create a new copy of the DataFrame instead of referencing the originaldf_product
.df_tmp['new_price'] = np.ceil(df_tmp['unit_cost'] / 0.7)
: This line calculates a new column called "new_price" indf_tmp
, which is the result of dividing the "unit_cost" column by 0.7 and then taking the ceiling of the result usingnumpy
'sceil()
function. This calculats a new price based on a profit margin of 30%.df_tmp['new_profit_rate'] = \ (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']
: This line calculates a new column called "new_profit_rate" indf_tmp
, which is the result of subtracting the "unit_cost" column from the "new_price" column and then dividing by the "new_price" column. This calculats the profit rate for the new price.df_tmp.head(10)
: This line prints out the first 10 rows of thedf_tmp
DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.
P-068: For each product in the product data (df_product), find the amount including tax at a consumption tax rate of 10%, rounding down fractions of a yen and displaying 10 results. Note that the unit price (unit_price) is missing.
df_tmp = df_tmp = df_product[['product_cd', 'unit_price']].copy()
df_tmp['tax_price'] = np.floor(df_tmp['unit_price'] * 1.1)
df_tmp.head(10)
product_cd | unit_price | tax_price | |
---|---|---|---|
0 | P040101001 | 198.0 | 217.0 |
1 | P040101002 | 218.0 | 239.0 |
2 | P040101003 | 230.0 | 253.0 |
3 | P040101004 | 248.0 | 272.0 |
4 | P040101005 | 268.0 | 294.0 |
5 | P040101006 | 298.0 | 327.0 |
6 | P040101007 | 338.0 | 371.0 |
7 | P040101008 | 420.0 | 462.0 |
8 | P040101009 | 498.0 | 547.0 |
9 | P040101010 | 580.0 | 638.0 |
Commentary :
This code also manipulats a pandas DataFrame object nameddf_product
.
Here's a breakdown of the code line-by-line:df_tmp = df_tmp = df_product[['product_cd', 'unit_price']].copy()
: This line creates a new DataFrame object nameddf_tmp
that is a subset ofdf_product
and includes only the columns "product_cd" and "unit_price". The.copy()
method is used to create a new copy of the DataFrame instead of referencing the originaldf_product
.df_tmp['tax_price'] = np.floor(df_tmp['unit_price'] * 1.1)
: This line calculates a new column called "tax_price" indf_tmp
, which is the result of multiplying the "unit_price" column by 1.1 and then taking the floor of the result usingnumpy
'sfloor()
function. This calculates a new price with a 10% tax added.df_tmp.head(10)
: This line prints out the first 10 rows of thedf_tmp
DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.
P-069: Combine receipt details data (df_receipt) and product data (df_product), calculate the total sales value of all products for each customer and the total sales value of products whose category major classification code (category_major_cd) is “07” (bottled canned food), and find the ratio between the two. Only customers with a sales record for category major category code “07” (bottled canned food) should be selected, and 10 results should be displayed.
# Code example 1
df_tmp_1 = df_receipt.groupby('customer_id').agg({'amount':'sum'}). \
reset_index().rename(columns={'amount':'sum_all'})
df_tmp_2 = pd.merge(df_receipt, df_product.query('category_major_cd == "07"'),
how='inner', on='product_cd').groupby('customer_id').\
agg({'amount':'sum'}).reset_index().\
rename(columns={'amount':'sum_07'})
df_tmp_3 = pd.merge(df_tmp_1, df_tmp_2, how='inner', on='customer_id')
df_tmp_3['sales_rate'] = df_tmp_3['sum_07'] / df_tmp_3['sum_all']
df_tmp_3.head(10)
customer_id | sum_all | sum_07 | sales_rate | |
---|---|---|---|---|
0 | CS001113000004 | 1298 | 1298 | 1.000000 |
1 | CS001114000005 | 626 | 486 | 0.776358 |
2 | CS001115000010 | 3044 | 2694 | 0.885020 |
3 | CS001205000004 | 1988 | 346 | 0.174044 |
4 | CS001205000006 | 3337 | 2004 | 0.600539 |
5 | CS001212000027 | 448 | 200 | 0.446429 |
6 | CS001212000031 | 296 | 296 | 1.000000 |
7 | CS001212000046 | 228 | 108 | 0.473684 |
8 | CS001212000070 | 456 | 308 | 0.675439 |
9 | CS001213000018 | 243 | 145 | 0.596708 |
Commentary :
This code also manipulats pandas DataFrame objects,df_receipt
anddf_product
.
Here's a breakdown of the code line-by-line:df_tmp_1 = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index().rename(columns={'amount':'sum_all'})
: This line groups thedf_receipt
DataFrame by "customer_id", calculates the sum of the "amount" column for each group, and then resets the index of the resulting DataFrame. The resulting DataFrame is then renamed to "sum_all" using the.rename()
method.df_tmp_2 = pd.merge(df_receipt, df_product.query('category_major_cd == "07"'), how='inner', on='product_cd').groupby('customer_id').agg({'amount':'sum'}).reset_index().rename(columns={'amount':'sum_07'})
: This line first filtersdf_product
to include only rows where the "category_major_cd" column equals "07". It then merges this filtered DataFrame withdf_receipt
on the "product_cd" column using an inner join. The resulting DataFrame is then grouped by "customer_id", the sum of the "amount" column is calculated for each group, and the index is reset. The resulting DataFrame is then renamed to "sum_07" using the.rename()
method.df_tmp_3 = pd.merge(df_tmp_1, df_tmp_2, how='inner', on='customer_id')
: This line mergesdf_tmp_1
anddf_tmp_2
on the "customer_id" column using an inner join. This creates a new DataFrame that contains the "sum_all" and "sum_07" columns for each customer.df_tmp_3['sales_rate'] = df_tmp_3['sum_07'] / df_tmp_3['sum_all']
: This line calculates a new column indf_tmp_3
called "sales_rate", which is the result of dividing the "sum_07" column by the "sum_all" column. This calculates the percentage of a customer's total purchases that are from the "07" category.df_tmp_3.head(10)
: This line prints out the first 10 rows of thedf_tmp_3
DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.
# Code example 2 (for reference, using unstack and horizontal sum)
df_temp = df_receipt.merge(df_product, how='left', on='product_cd'). \
groupby(['customer_id', 'category_major_cd'])['amount'].sum().unstack()
df_temp = df_temp[df_temp['07'] > 0]
df_temp['sum_all'] = df_temp.sum(axis=1)
df_temp['sales_rate'] = df_temp['07'] / df_temp['sum_all']
# The rest of the process is for formatting and displaying the data frame
df_temp.columns.name = ''
df_temp = df_temp.reset_index()
df_temp.head(10)
customer_id | 04 | 05 | 06 | 07 | 08 | 09 | sum_all | sales_rate | |
---|---|---|---|---|---|---|---|---|---|
0 | CS001113000004 | NaN | NaN | NaN | 1298.0 | NaN | NaN | 1298.0 | 1.000000 |
1 | CS001114000005 | NaN | 40.0 | NaN | 486.0 | 100.0 | NaN | 626.0 | 0.776358 |
2 | CS001115000010 | NaN | NaN | NaN | 2694.0 | NaN | 350.0 | 3044.0 | 0.885020 |
3 | CS001205000004 | 100.0 | 128.0 | 286.0 | 346.0 | 368.0 | 760.0 | 1988.0 | 0.174044 |
4 | CS001205000006 | 635.0 | 60.0 | 198.0 | 2004.0 | 80.0 | 360.0 | 3337.0 | 0.600539 |
5 | CS001212000027 | 248.0 | NaN | NaN | 200.0 | NaN | NaN | 448.0 | 0.446429 |
6 | CS001212000031 | NaN | NaN | NaN | 296.0 | NaN | NaN | 296.0 | 1.000000 |
7 | CS001212000046 | NaN | NaN | NaN | 108.0 | NaN | 120.0 | 228.0 | 0.473684 |
8 | CS001212000070 | NaN | NaN | 148.0 | 308.0 | NaN | NaN | 456.0 | 0.675439 |
9 | CS001213000018 | NaN | NaN | NaN | 145.0 | 98.0 | NaN | 243.0 | 0.596708 |
Commentary :
This code also manipulates pandas DataFrame objects,df_receipt
anddf_product
.
Here's a breakdown of the code line-by-line:df_temp = df_receipt.merge(df_product, how='left', on='product_cd').groupby(['customer_id', 'category_major_cd'])['amount'].sum().unstack()
: This line first mergesdf_receipt
anddf_product
on the "product_cd" column using a left join. It then groups the resulting DataFrame by "customer_id" and "category_major_cd", calculates the sum of the "amount" column for each group, and unstacks the resulting Series to create a DataFrame with "customer_id" as the index, "category_major_cd" as the column names, and the sum of the "amount" column as the values.df_temp = df_temp[df_temp['07'] > 0]
: This line filters thedf_temp
DataFrame to only include rows where the value in the "07" column is greater than 0. This removes customers who have not purchased anything from the "07" category.df_temp['sum_all'] = df_temp.sum(axis=1)
: This line calculates a new column indf_temp
called "sum_all", which is the sum of all the columns in each row. This calculats the total amount spent by each customer across all categories.df_temp['sales_rate'] = df_temp['07'] / df_temp['sum_all']
: This line calculates a new column indf_temp
called "sales_rate", which is the result of dividing the "07" column by the "sum_all" column. This calculates the percentage of a customer's total purchases that are from the "07" category.df_temp.columns.name = ''
: This line removes the column name from thedf_temp
DataFrame.df_temp = df_temp.reset_index()
: This line resets the index ofdf_temp
to create a new column called "index" with a range of integers starting from 0.df_temp.head(10)
: This line prints out the first 10 rows of thedf_temp
DataFrame. This uses to check that the calculations in the previous lines have been performed correctly.
P-070: Calculate the number of days elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt) and display 10 items with the customer ID (customer_id), sales date and membership application date. (Note that sales_ymd is numeric and application_date is a string).
df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()
df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']],
how='inner', on='customer_id')
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
df_tmp['elapsed_days'] = df_tmp['sales_ymd'] - df_tmp['application_date']
df_tmp['elapsed_days'] = df_tmp['elapsed_days'].dt.days
df_tmp.head(10)
customer_id | sales_ymd | application_date | elapsed_days | |
---|---|---|---|---|
0 | CS006214000001 | 2018-11-03 | 1970-01-01 00:00:00.020150201 | 17837 |
1 | CS006214000001 | 2017-05-09 | 1970-01-01 00:00:00.020150201 | 17294 |
2 | CS006214000001 | 2017-06-08 | 1970-01-01 00:00:00.020150201 | 17324 |
3 | CS006214000001 | 2018-10-28 | 1970-01-01 00:00:00.020150201 | 17831 |
4 | CS006214000001 | 2019-09-08 | 1970-01-01 00:00:00.020150201 | 18146 |
5 | CS006214000001 | 2018-01-31 | 1970-01-01 00:00:00.020150201 | 17561 |
6 | CS006214000001 | 2017-07-05 | 1970-01-01 00:00:00.020150201 | 17351 |
7 | CS006214000001 | 2018-11-10 | 1970-01-01 00:00:00.020150201 | 17844 |
8 | CS006214000001 | 2019-04-10 | 1970-01-01 00:00:00.020150201 | 17995 |
9 | CS006214000001 | 2019-06-01 | 1970-01-01 00:00:00.020150201 | 18047 |
Commentary :
This code is performing data manipulation on a DataFrame nameddf_receipt
and another DataFrame nameddf_customer
. It calculates the number of days between a customer's application date and their purchase date.
Here's a step-by-step breakdown of the code:df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()
: creates a new DataFrame nameddf_tmp
by selecting only the columns "customer_id" and "sales_ymd" from thedf_receipt
DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id')
: performs an inner join ondf_tmp
and thedf_customer
DataFrame, selecting only the "customer_id" and "application_date" columns from thedf_customer
DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in bothdf_tmp
anddf_customer
.df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
: converts the "sales_ymd" column ofdf_tmp
to a datetime format using thepd.to_datetime()
function. This function converts the column from a string format to a datetime format that can be used for calculations.df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
: converts the "application_date" column ofdf_tmp
to a datetime format using thepd.to_datetime()
function. This function converts the column from a string format to a datetime format that can be used for calculations.df_tmp['elapsed_days'] = df_tmp['sales_ymd'] - df_tmp['application_date']
: calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_days" indf_tmp
. This calculation will result in a timedelta object, which represents the difference between two dates in days, hours, minutes, and seconds.df_tmp['elapsed_days'] = df_tmp['elapsed_days'].dt.days
: extracts only the days from the "elapsed_days" column and assigns the result back to the same column. This step is necessary because the "elapsed_days" column currently contains a timedelta object, but we only want to keep the number of days.df_tmp.head(10)
: displays the first 10 rows ofdf_tmp
to verify that the calculations were performed correctly.
Overall, this code is useful for analyzing customer behavior and determining how long it takes for customers to make a purchase after applying for a membership or account.
P-071: Calculate the number of months elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).
df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()
df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']],
how='inner', on='customer_id')
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
df_tmp['elapsed_months'] = df_tmp[['sales_ymd', 'application_date']]. \
apply(lambda x: relativedelta(x[0], x[1]).years * 12 + \
relativedelta(x[0], x[1]).months, axis=1)
df_tmp.head(10)
customer_id | sales_ymd | application_date | elapsed_months | |
---|---|---|---|---|
0 | CS006214000001 | 2018-11-03 | 1970-01-01 00:00:00.020150201 | 586 |
1 | CS006214000001 | 2017-05-09 | 1970-01-01 00:00:00.020150201 | 568 |
2 | CS006214000001 | 2017-06-08 | 1970-01-01 00:00:00.020150201 | 569 |
3 | CS006214000001 | 2018-10-28 | 1970-01-01 00:00:00.020150201 | 585 |
4 | CS006214000001 | 2019-09-08 | 1970-01-01 00:00:00.020150201 | 596 |
5 | CS006214000001 | 2018-01-31 | 1970-01-01 00:00:00.020150201 | 576 |
6 | CS006214000001 | 2017-07-05 | 1970-01-01 00:00:00.020150201 | 570 |
7 | CS006214000001 | 2018-11-10 | 1970-01-01 00:00:00.020150201 | 586 |
8 | CS006214000001 | 2019-04-10 | 1970-01-01 00:00:00.020150201 | 591 |
9 | CS006214000001 | 2019-06-01 | 1970-01-01 00:00:00.020150201 | 592 |
Commentary :
This code is similar to the previous code we discussed, but instead of calculating the number of days between a customer's application date and their purchase date, it calculates the number of elapsed months.
Here's a step-by-step breakdown of the code:df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()
: creates a new DataFrame nameddf_tmp
by selecting only the columns "customer_id" and "sales_ymd" from thedf_receipt
DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id')
: performs an inner join ondf_tmp
and thedf_customer
DataFrame, selecting only the "customer_id" and "application_date" columns from thedf_customer
DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in bothdf_tmp
anddf_customer
.df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
: converts the "sales_ymd" column ofdf_tmp
to a datetime format using thepd.to_datetime()
function.df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
: converts the "application_date" column ofdf_tmp
to a datetime format using thepd.to_datetime()
function.df_tmp['elapsed_months'] = df_tmp[['sales_ymd', 'application_date']]. \ apply(lambda x: relativedelta(x[0], x[1]).years * 12 + \ relativedelta(x[0], x[1]).months, axis=1)
: calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_months" indf_tmp
. This calculation uses therelativedelta
function from thedateutil
library to compute the difference between two datetime values in terms of years and months. Theapply()
method applies this function to each row of the "sales_ymd" and "application_date" columns, and returns the total number of elapsed months.df_tmp.head(10)
: displays the first 10 rows ofdf_tmp
to verify that the calculations were performed correctly.
Overall, this code is useful for analyzing customer behavior and determining how long it takes for customers to make a purchase after applying for a membership or account, in terms of elapsed months.
P-072: Calculate the number of years elapsed from the membership application date (application_date) of the customer data (df_customer) against the sales date (df_customer) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and membership application date (Note that sales_ymd is a number and application_date is a string).
df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()
df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']],
how='inner', on='customer_id')
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
df_tmp['elapsed_years'] = df_tmp[['sales_ymd', 'application_date']]. \
apply(lambda x: relativedelta(x[0], x[1]).years, axis=1)
df_tmp.head(10)
customer_id | sales_ymd | application_date | elapsed_years | |
---|---|---|---|---|
0 | CS006214000001 | 2018-11-03 | 1970-01-01 00:00:00.020150201 | 48 |
1 | CS006214000001 | 2017-05-09 | 1970-01-01 00:00:00.020150201 | 47 |
2 | CS006214000001 | 2017-06-08 | 1970-01-01 00:00:00.020150201 | 47 |
3 | CS006214000001 | 2018-10-28 | 1970-01-01 00:00:00.020150201 | 48 |
4 | CS006214000001 | 2019-09-08 | 1970-01-01 00:00:00.020150201 | 49 |
5 | CS006214000001 | 2018-01-31 | 1970-01-01 00:00:00.020150201 | 48 |
6 | CS006214000001 | 2017-07-05 | 1970-01-01 00:00:00.020150201 | 47 |
7 | CS006214000001 | 2018-11-10 | 1970-01-01 00:00:00.020150201 | 48 |
8 | CS006214000001 | 2019-04-10 | 1970-01-01 00:00:00.020150201 | 49 |
9 | CS006214000001 | 2019-06-01 | 1970-01-01 00:00:00.020150201 | 49 |
Commentary :
This code is similar to the previous examples we discussed, but instead of calculating the number of elapsed days or months between a customer's application date and purchase date, it calculates the number of elapsed years.
Here's a step-by-step breakdown of the code:df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()
: creates a new DataFrame nameddf_tmp
by selecting only the columns "customer_id" and "sales_ymd" from thedf_receipt
DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id')
: performs an inner join ondf_tmp
and thedf_customer
DataFrame, selecting only the "customer_id" and "application_date" columns from thedf_customer
DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in bothdf_tmp
anddf_customer
.df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
: converts the "sales_ymd" column ofdf_tmp
to a datetime format using thepd.to_datetime()
function.df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
: converts the "application_date" column ofdf_tmp
to a datetime format using thepd.to_datetime()
function.df_tmp['elapsed_years'] = df_tmp[['sales_ymd', 'application_date']]. \ apply(lambda x: relativedelta(x[0], x[1]).years, axis=1)
: calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_years" indf_tmp
. This calculation uses therelativedelta
function from thedateutil
library to compute the difference between two datetime values in terms of years. Theapply()
method applies this function to each row of the "sales_ymd" and "application_date" columns, and returns the total number of elapsed years.df_tmp.head(10)
: displays the first 10 rows ofdf_tmp
to verify that the calculations were performed correctly.
Overall, this code is useful for analyzing customer behavior and determining how long it takes for customers to make a purchase after applying for a membership or account, in terms of elapsed years.
P-073: Calculate the elapsed time in epoch seconds from the membership application date (application_date) of the customer data (df_customer) against the sales date (sales_ymd) of the receipt details data (df_receipt), and display 10 items with customer ID (customer_id), sales date and Note that sales_ymd is held as a number and application_date as a string. Note that no time information is held, so each date should represent 0:00:00:00.
df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()
df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']],
how='inner', on='customer_id')
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
df_tmp['elapsed_epoch'] = df_tmp['sales_ymd'].view(np.int64) - \
df_tmp['application_date'].view(np.int64)
df_tmp['elapsed_epoch'] = df_tmp['elapsed_epoch'] / 10**9
df_tmp.head(10)
customer_id | sales_ymd | application_date | elapsed_epoch | |
---|---|---|---|---|
0 | CS006214000001 | 2018-11-03 | 1970-01-01 00:00:00.020150201 | 1.541203e+09 |
1 | CS006214000001 | 2017-05-09 | 1970-01-01 00:00:00.020150201 | 1.494288e+09 |
2 | CS006214000001 | 2017-06-08 | 1970-01-01 00:00:00.020150201 | 1.496880e+09 |
3 | CS006214000001 | 2018-10-28 | 1970-01-01 00:00:00.020150201 | 1.540685e+09 |
4 | CS006214000001 | 2019-09-08 | 1970-01-01 00:00:00.020150201 | 1.567901e+09 |
5 | CS006214000001 | 2018-01-31 | 1970-01-01 00:00:00.020150201 | 1.517357e+09 |
6 | CS006214000001 | 2017-07-05 | 1970-01-01 00:00:00.020150201 | 1.499213e+09 |
7 | CS006214000001 | 2018-11-10 | 1970-01-01 00:00:00.020150201 | 1.541808e+09 |
8 | CS006214000001 | 2019-04-10 | 1970-01-01 00:00:00.020150201 | 1.554854e+09 |
9 | CS006214000001 | 2019-06-01 | 1970-01-01 00:00:00.020150201 | 1.559347e+09 |
Commentary :
This code calculates the elapsed time in seconds between a customer's application date and purchase date by converting the datetime values to Unix timestamps.
Here's a step-by-step breakdown of the code:df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates()
: creates a new DataFrame nameddf_tmp
by selecting only the columns "customer_id" and "sales_ymd" from thedf_receipt
DataFrame, and then dropping any duplicate rows. The resulting DataFrame will have a unique combination of customer IDs and sales dates.df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id')
: performs an inner join ondf_tmp
and thedf_customer
DataFrame, selecting only the "customer_id" and "application_date" columns from thedf_customer
DataFrame. The resulting DataFrame will only contain rows with customer IDs that are present in bothdf_tmp
anddf_customer
.df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
: converts the "sales_ymd" column ofdf_tmp
to a datetime format using thepd.to_datetime()
function.df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date'])
: converts the "application_date" column ofdf_tmp
to a datetime format using thepd.to_datetime()
function.df_tmp['elapsed_epoch'] = df_tmp['sales_ymd'].view(np.int64) - \ df_tmp['application_date'].view(np.int64)
: calculates the difference between the "sales_ymd" and "application_date" columns and assigns the result to a new column named "elapsed_epoch" indf_tmp
. This calculation uses theview()
method to convert the datetime values to an int64 format (Unix timestamp), and then subtracts the two values element-wise to obtain the elapsed time in nanoseconds.df_tmp['elapsed_epoch'] = df_tmp['elapsed_epoch'] / 10**9
: converts the elapsed time from nanoseconds to seconds by dividing the "elapsed_epoch" column by 10^9.df_tmp.head(10)
: displays the first 10 rows ofdf_tmp
to verify that the calculations were performed correctly.
Overall, this code is useful for analyzing customer behavior and determining the length of time it takes for customers to make a purchase after applying for a membership or account, in terms of elapsed seconds.
P-074: For the sales date (sales_ymd) of the receipt details data (df_receipt), calculate the number of days elapsed since Monday of the week in question and display 10 items together with the sales date and the previous Monday (note that sales_ymd holds data in numerical form).
df_tmp = df_receipt[['sales_ymd']].copy()
df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str'))
df_tmp['elapsed_days'] = df_tmp['sales_ymd'].apply(lambda x:x.weekday())
df_tmp['monday'] = \
df_tmp['sales_ymd'].apply(lambda x: x - relativedelta(days=x.weekday()))
df_tmp.head(10)
sales_ymd | elapsed_days | monday | |
---|---|---|---|
0 | 2018-11-03 | 5 | 2018-10-29 |
1 | 2018-11-18 | 6 | 2018-11-12 |
2 | 2017-07-12 | 2 | 2017-07-10 |
3 | 2019-02-05 | 1 | 2019-02-04 |
4 | 2018-08-21 | 1 | 2018-08-20 |
5 | 2019-06-05 | 2 | 2019-06-03 |
6 | 2018-12-05 | 2 | 2018-12-03 |
7 | 2019-09-22 | 6 | 2019-09-16 |
8 | 2017-05-04 | 3 | 2017-05-01 |
9 | 2019-10-10 | 3 | 2019-10-07 |
Commentary :
The code is performing the following steps:
A new DataFrame calleddf_tmp
is created by selecting a single columnsales_ymd
from an existing DataFramedf_receipt
, and then making a copy of that column.
Thesales_ymd
column indf_tmp
is then converted to a pandas datetime format using thepd.to_datetime
function. This allows the code to work with dates in a more flexible way, such as being able to extract information like the day of the week.
A new column calledelapsed_days
is created indf_tmp
using theapply
method. This method applies a lambda function to each row of thesales_ymd
column, which in this case returns the day of the week as an integer (with Monday being 0 and Sunday being 6).
Another new column calledmonday
is created indf_tmp
using theapply
method with another lambda function. This function takes each row in thesales_ymd
column, subtracts the number of days elapsed since the beginning of the week (which is obtained fromx.weekday()
), and returns the resulting date rounded down to the nearest Monday. This effectively groups all the dates indf_tmp
into weekly intervals starting on a Monday.
Finally, thehead
method is called ondf_tmp
to display the first 10 rows of the resulting DataFrame.
Overall, this code processes a column of sales dates to extract useful information for analysis or aggregation purposes. Specifically, it creates new columns in the DataFrame that represent the day of the week and the start date of the corresponding week, which could be used for grouping and aggregating sales data by week or day of the week.
P-075: Randomly extract 1% of the data from the customer data (df_customer) and display the first 10 items.
df_customer.sample(frac=0.01).head(10)
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|---|
9589 | CS014703000006 | 長谷部 扶樹 | 0 | 男性 | 1944-05-15 | 74 | 264-0037 | 千葉県千葉市若葉区源町********** | S12014 | 20150707 | 0-00000000-0 |
13962 | CS024612000073 | 柴田 奈月 | 1 | 女性 | 1949-06-11 | 69 | 216-0015 | 神奈川県川崎市宮前区菅生********** | S14024 | 20150927 | 0-00000000-0 |
14351 | CS006504000007 | 小島 貴嶺 | 0 | 男性 | 1964-06-30 | 54 | 224-0062 | 神奈川県横浜市都筑区葛が谷********** | S14006 | 20150821 | 0-00000000-0 |
6092 | CS003513000071 | 菊地 瞬 | 1 | 女性 | 1962-12-27 | 56 | 182-0022 | 東京都調布市国領町********** | S13003 | 20150616 | 0-00000000-0 |
18415 | CS037415000043 | 市川 路子 | 1 | 女性 | 1976-03-17 | 43 | 136-0072 | 東京都江東区大島********** | S13037 | 20151130 | D-20101020-D |
6023 | CS019603000011 | 秋本 隼士 | 0 | 男性 | 1952-02-24 | 67 | 171-0051 | 東京都豊島区長崎********** | S13019 | 20150413 | 0-00000000-0 |
17698 | CS015515000283 | 上杉 淳子 | 1 | 女性 | 1963-02-20 | 56 | 136-0076 | 東京都江東区南砂********** | S13015 | 20180918 | 0-00000000-0 |
9157 | CS017415000099 | 小柳 光 | 1 | 女性 | 1976-03-20 | 43 | 166-0002 | 東京都杉並区高円寺北********** | S13017 | 20150724 | E-20100528-D |
18939 | CS003613000341 | 益田 京子 | 1 | 女性 | 1955-02-11 | 64 | 201-0013 | 東京都狛江市元和泉********** | S13003 | 20171205 | 0-00000000-0 |
3321 | CS003401000072 | 伊集院 三省 | 0 | 男性 | 1971-09-25 | 47 | 206-0812 | 東京都稲城市矢野口********** | S13003 | 20180422 | 0-00000000-0 |
Commentary :
The code is sampling a subset of rows from a pandas DataFrame calleddf_customer
, and then displaying the first 10 rows of that subset.
Here's what each part of the code does:df_customer
is a pandas DataFrame that contains customer data, presumably with many rows and columns.
Thesample
method is called ondf_customer
with the argumentfrac=0.01
. This method randomly selects a fraction of the rows in the DataFrame, wherefrac
represents the fraction of rows to sample. In this case,frac=0.01
means that 1% of the rows indf_customer
will be selected at random.
Thehead
method is called on the resulting subset ofdf_customer
, with the argument10
. This method returns the first 10 rows of the subset, which are then displayed.
So overall, this code is selecting a random sample of 1% of the rows fromdf_customer
, and then displaying the first 10 rows of that sample. This could be useful for exploring the data or performing quick analyses on a smaller subset of the data without having to work with the entire dataset.
P-076: Extract 10% of the data stratified randomly from the customer data (df_customer) based on the percentage of gender code (gender_cd) and count the number of cases by gender code.
# Example using sklearn.model_selection.train_test_split
_, df_tmp = train_test_split(df_customer, test_size=0.1,
stratify=df_customer['gender_cd'])
df_tmp.groupby('gender_cd').agg({'customer_id' : 'count'})
customer_id | |
---|---|
gender_cd | |
0 | 298 |
1 | 1793 |
9 | 107 |
Commentary :
The code performs the following steps:train_test_split
is a function from the scikit-learn library that randomly splits a dataset into two subsets: a "training" subset and a "testing" subset. In this code, it is applied to a pandas DataFrame calleddf_customer
, with the argumenttest_size=0.1
specifying that 10% of the data should be reserved for testing, and the argumentstratify=df_customer['gender_cd']
specifying that the split should be stratified by the values in thegender_cd
column. The function returns two DataFrames, but the first one is ignored (assigned to_
, which is a convention in Python for a variable that is not going to be used).
The second DataFrame returned bytrain_test_split
is assigned to a new variable calleddf_tmp
.
Thegroupby
method is called ondf_tmp
with the argumentgender_cd
. This method groups the rows ofdf_tmp
by the values in thegender_cd
column.
Theagg
method is called on the result of thegroupby
operation, with the argument{'customer_id' : 'count'}
. This method applies a function (in this case,count
) to thecustomer_id
column of each group, and returns a DataFrame with the resulting counts for each group.
The resulting DataFrame is displayed.
So overall, this code randomly splits a pandas DataFrame into training and testing subsets, stratifying the split by the values in a specific column (gender_cd
). It then creates a new DataFrame from the testing subset and groups it by thegender_cd
column, and calculates the count of unique values of thecustomer_id
column for each group. This can be useful for understanding the distribution of customers by gender in the testing subset of the data.
P-077: Sum the sales amounts in the receipt details data (df_receipt) by customer unit and extract outliers of the summed sales amounts. The outliers should be calculated by logarithmising the total sales amount, calculating the mean and standard deviation, and then deviating from the mean by more than 3σ (either the natural logarithm or the common logarithm is acceptable). Display 10 results.
df_sales_amount = df_receipt.groupby('customer_id'). \
agg({'amount':'sum'}).reset_index()
df_sales_amount['log_sum_amount'] = np.log(df_sales_amount['amount'] + 0.5)
df_sales_amount['log_sum_amount_ss'] = preprocessing.scale(df_sales_amount['log_sum_amount'])
df_sales_amount.query('abs(log_sum_amount_ss) > 3').head(10)
customer_id | amount | log_sum_amount | log_sum_amount_ss | |
---|---|---|---|---|
8306 | ZZ000000000000 | 12395003 | 16.332804 | 7.967822 |
Commentary :
This code is performing data preprocessing and filtering on a pandas DataFrame object nameddf_receipt
.
Here is a step-by-step explanation of the code:df_sales_amount = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index()
: This line groups the rows ofdf_receipt
by the values in thecustomer_id
column and calculates the sum of theamount
column for each group. The resulting DataFramedf_sales_amount
has one row per uniquecustomer_id
value and two columns:customer_id
andamount
. Thereset_index()
method is used to reset the index to integer values after the grouping.df_sales_amount['log_sum_amount'] = np.log(df_sales_amount['amount'] + 0.5)
: This line creates a new column indf_sales_amount
namedlog_sum_amount
that contains the natural logarithm of the sum ofamount
for each customer, plus 0.5 to avoid taking the log of 0.df_sales_amount['log_sum_amount_ss'] = preprocessing.scale(df_sales_amount['log_sum_amount'])
: This line creates another new column indf_sales_amount
namedlog_sum_amount_ss
that contains the scaled values oflog_sum_amount
using thepreprocessing.scale
function from the scikit-learn library. This function subtracts the mean and divides by the standard deviation of the input array, resulting in a new array with zero mean and unit variance.df_sales_amount.query('abs(log_sum_amount_ss) > 3').head(10)
: This line filtersdf_sales_amount
to only keep the rows where the absolute value oflog_sum_amount_ss
is greater than 3, and then selects the first 10 rows of the resulting DataFrame. This filter selects customers whose total sales amount is more than 3 standard deviations away from the mean, indicating that they may be outliers. Thequery
method is used to filter the rows based on a string expression, which in this case is'abs(log_sum_amount_ss) > 3'
.
P-078: Sum the sales amount (amount) of the receipt details data (df_receipt) in customer units and extract outliers of the summed sales amount. However, exclude those whose customer ID starts with “Z”, as they represent non-members. Outliers are defined as the difference between the first and third quartiles using IQR, and are defined as being below “first quartile – 1.5 x IQR” or above “third quartile + 1.5 x IQR”. Show 10 results.
df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")',
engine='python'). \
groupby('customer_id'). \
agg({'amount':'sum'}).reset_index()
pct25 = np.percentile(df_sales_amount['amount'], q=25)
pct75 = np.percentile(df_sales_amount['amount'], q=75)
iqr = pct75 - pct25
amount_low = pct25 - (iqr * 1.5)
amount_hight = pct75 + (iqr * 1.5)
df_sales_amount.query('amount < @amount_low or @amount_hight < amount').head(10)
customer_id | amount | |
---|---|---|
98 | CS001414000048 | 8584 |
332 | CS001605000009 | 18925 |
549 | CS002415000594 | 9568 |
1180 | CS004414000181 | 9584 |
1558 | CS005415000137 | 8734 |
1733 | CS006414000001 | 9156 |
1736 | CS006414000029 | 9179 |
1752 | CS006415000105 | 10042 |
1755 | CS006415000147 | 12723 |
1757 | CS006415000157 | 10648 |
Commentary :
This code is performing data preprocessing and filtering on a pandas DataFrame object nameddf_receipt
.
Here is a step-by-step explanation of the code:df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').agg({'amount':'sum'}).reset_index()
: This line first filtersdf_receipt
to exclude rows where thecustomer_id
column starts with the letter "Z". Thequery
method is used to filter the rows based on a string expression. Theengine='python'
argument is used to specify the use of Python engine for the query, which allows for the use of thestr.startswith
method. The resulting DataFrame is then grouped bycustomer_id
and the sum of theamount
column is calculated for each group. The resulting DataFramedf_sales_amount
has one row per uniquecustomer_id
value and two columns:customer_id
andamount
. Thereset_index()
method is used to reset the index to integer values after the grouping.pct25 = np.percentile(df_sales_amount['amount'], q=25)
: This line calculates the 25th percentile of theamount
column indf_sales_amount
using thenumpy.percentile
function and assigns the result to the variablepct25
.pct75 = np.percentile(df_sales_amount['amount'], q=75)
: This line calculates the 75th percentile of theamount
column indf_sales_amount
using thenumpy.percentile
function and assigns the result to the variablepct75
.iqr = pct75 - pct25
: This line calculates the interquartile range (IQR) of theamount
column indf_sales_amount
by subtracting the 25th percentile (pct25
) from the 75th percentile (pct75
) and assigns the result to the variableiqr
.amount_low = pct25 - (iqr * 1.5)
: This line calculates the lower bound of the "normal" range of sales amounts using the formulaamount_low = pct25 - (iqr * 1.5)
. This is commonly known as the lower "whisker" of the box plot.amount_high = pct75 + (iqr * 1.5)
: This line calculates the upper bound of the "normal" range of sales amounts using the formulaamount_high = pct75 + (iqr * 1.5)
. This is commonly known as the upper "whisker" of the box plot.df_sales_amount.query('amount < @amount_low or @amount_high < amount').head(10)
: This line filtersdf_sales_amount
to only keep the rows where theamount
column is less thanamount_low
or greater thanamount_high
, and then selects the first 10 rows of the resulting DataFrame. This filter selects customers whose total sales amount falls outside the "normal" range of sales amounts, which is defined as the range between the lower and upper whiskers of the box plot. Thequery
method is used to filter the rows based on a string expression that uses the@
symbol to refer to the Python variableamount_low
andamount_high
.
P-079: For each item of product data (df_product), check the number of missing items.
df_product.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 7 unit_cost 7 dtype: int64
Commentary :
This code is used to calculate the number of missing (null) values in each column of a pandas DataFramedf_product
. Here is a step-by-step explanation:df_product
: This is the name of the DataFrame being used.isnull()
: This method returns a DataFrame of the same shape asdf_product
where each element is a boolean value indicating whether the corresponding element indf_product
is null or not.sum()
: This method is applied to the boolean DataFrame returned byisnull()
and returns the sum of the boolean values for each column. Since boolean values are treated as integers, this effectively counts the number of null values in each column ofdf_product
.
Therefore,df_product.isnull().sum()
returns a Series object where the index is the column names ofdf_product
and the values are the number of null values in each column. This can be used to identify which columns have missing values and how many missing values there are in each column.
P-080: Create a new product data, deleting all records where any item in the product data (df_product) is missing. Note that the number of items before and after the deletion should be displayed, and also confirm that the number of items has decreased by the number of items checked in 079.
df_product_1 = df_product.copy()
df_product_1.dropna(inplace=True)
print('Before deletion:', len(df_product))
print('After deletion:', len(df_product_1))
Before deletion: 10030
After deletion: 10023
Commentary :
This code is used to remove all rows containing any null (missing) values from a pandas DataFramedf_product
and create a new DataFramedf_product_1
that contains only the non-null rows. Here is a step-by-step explanation:df_product
: This is the name of the original DataFrame that is being used.df_product.copy()
: This method creates a copy of the original DataFramedf_product
so that the original DataFrame is not modified.df_product_1
: This is the name of the new DataFrame that is being created.dropna()
: This method is used to drop all rows that contain any null values fromdf_product_1
. Theinplace=True
argument is used to modifydf_product_1
in place rather than returning a new DataFrame.print('Before deletion:', len(df_product))
: This line prints the number of rows in the original DataFramedf_product
before the null values are dropped.print('After deletion:', len(df_product_1))
: This line prints the number of rows in the new DataFramedf_product_1
after the null values are dropped.
Therefore, this code first creates a copy of the original DataFramedf_product
, drops all rows that contain any null values from the copied DataFrame to create a new DataFramedf_product_1
, and finally prints the number of rows in the original and new DataFrames before and after the deletion of null values. This code can be used to identify the number of rows that are removed due to null values and to create a clean DataFrame that does not contain any null values.
Comment