参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Getting Started
- First, run the following cells
- Import the necessary libraries and read data from the database (PostgreSQL)
- The following cell imports libraries that are expected to be used, such as pandas.
- If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
- The process can be divided into multiple times.
- Names, addresses, etc. are dummy data and are not real.
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
if 'PG_PORT' in os.environ:
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
# Connector for pd.read_sql
conn = psycopg2.connect(**pgconfig)
df_customer = pd.read_sql(sql='select * from customer', con=conn)
df_category = pd.read_sql(sql='select * from category', con=conn)
df_product = pd.read_sql(sql='select * from product', con=conn)
df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
df_store = pd.read_sql(sql='select * from store', con=conn)
df_geocode = pd.read_sql(sql='select * from geocode', con=conn)
else:
if not os.path.exists('data/'):
!git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
os.chdir('100knocks-preprocess/docker/work/')
dtype = {
'customer_id': str,
'gender_cd': str,
'postal_cd': str,
'application_store_cd': str,
'status_cd': str,
'category_major_cd': str,
'category_medium_cd': str,
'category_small_cd': str,
'product_cd': str,
'store_cd': str,
'prefecture_cd': str,
'tel_no': str,
'postal_cd': str,
'street': str
}
df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
df_category = pd.read_csv("data/category.csv", dtype=dtype)
df_product = pd.read_csv("data/product.csv", dtype=dtype)
df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
df_store = pd.read_csv("data/store.csv", dtype=dtype)
df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)
fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.
Exercise
P-081: Create new product data for the missing values of unit_price(unit_price) and unit_cost(unit_cost),supplemented by the average value of each. Note that for the average values, round off values less than one yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
# Code example 1 (fillna in Pandas)
df_product_2 = df_product.fillna({
'unit_price':np.round(np.nanmean(df_product['unit_price'])),
'unit_cost':np.round(np.nanmean(df_product['unit_cost']))})
df_product_2.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
This code performs two operations on a pandas DataFrame nameddf_product
:
It fills missing values in theunit_price
andunit_cost
columns with the rounded mean of each column using thefillna()
method. The rounded mean is calculated using thenp.round()
andnp.nanmean()
functions from the NumPy library. Thefillna()
method takes a dictionary as its argument, where the keys are the column names and the values are the values to fill missing values with.
It then checks for any remaining missing values in the DataFrame using theisnull()
method, and sums up the number of missing values in each column using thesum()
method. The resulting output is the number of missing values in each column of the updated DataFramedf_product_2
.
The updated DataFramedf_product_2
will have missing values in theunit_price
andunit_cost
columns replaced with their respective rounded mean values, and the output of the second line will show how many missing values remain in the DataFrame after the fillna operation.
# Code example 2 (SimpleImputer in scikit-learn)
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_values = imp_mean.fit_transform(df_product[['unit_price', 'unit_cost']])
df_product_2 = df_product.copy()
df_product_2[['unit_price', 'unit_cost']] = imp_values.round()
df_product_2.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
This code performs the following operations:
It imports theSimpleImputer
class from thesklearn.impute
module. TheSimpleImputer
class is used for imputing missing values in a dataset.
It creates an instance ofSimpleImputer
calledimp_mean
. This imputer will replace missing values (which are represented asnp.nan
in this case) with the mean value of the corresponding column. The strategy for imputation is set to'mean'
.
It applies theSimpleImputer
object to theunit_price
andunit_cost
columns of thedf_product
DataFrame using thefit_transform()
method, which returns the transformed DataFrame with missing values imputed.
It creates a copy of the originaldf_product
DataFrame calleddf_product_2
using thecopy()
method.
It replaces theunit_price
andunit_cost
columns ofdf_product_2
with the imputed values by assigningimp_values.round()
to those columns. Theround()
method is used to round the imputed values to the nearest integer.
It checks for any remaining missing values in the DataFrame using theisnull()
method, and sums up the number of missing values in each column using thesum()
method. The resulting output is the number of missing values in each column of the updated DataFramedf_product_2
.
The final output of this code will show the number of missing values in the updateddf_product_2
DataFrame after imputing missing values with the mean of each column. This approach can be useful for handling missing data in a dataset, especially if there are only a small number of missing values.
P-082: Create new product data for missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the median value of each. Note that for the median value, round off the figures to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
# Code example 1 (fillna in Pandas)
df_product_3 = df_product.fillna({
'unit_price':np.round(np.nanmedian(df_product['unit_price'])),
'unit_cost':np.round(np.nanmedian(df_product['unit_cost']))})
df_product_3.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
The code first fills the missing values in two columns, "unit_price" and "unit_cost", of the DataFrame named "df_product". It fills the missing values using thefillna
method with the following arguments:
The first argument is a dictionary with two keys, "unit_price" and "unit_cost", each mapping to a corresponding value. These values are computed using the NumPy library'snanmedian
function applied to the respective columns of the DataFrame.nanmedian
calculates the median value of the column after ignoring any missing values represented by NaN.
The second argument is omitted in this code, meaning that any missing values in the other columns of the DataFrame are left untouched.
The result of this operation is a new DataFrame called "df_product_3".
The next line of code checks if there are still any missing values in the "df_product_3" DataFrame by calling theisnull()
method and then thesum()
method. This returns a Series object that contains the number of missing values for each column of the DataFrame. If the output shows that there are no missing values in either column, it confirms that the fillna operation was successful.
# Code example 2 (SimpleImputer in scikit-learn)
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')
imp_values = imp_mean.fit_transform(df_product[['unit_price', 'unit_cost']])
df_product_3 = df_product.copy()
df_product_3[['unit_price', 'unit_cost']] = imp_values.round()
df_product_3.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using scikit-learn's SimpleImputer.
The first line of the code initializes an instance of SimpleImputer named "imp_mean". This class provides a way to impute missing values with a given strategy, which in this case is the median of the non-missing values. Themissing_values
argument specifies the value to be considered as missing, which is NaN in this case.
The second line of the code uses thefit_transform
method of the "imp_mean" object to impute missing values in the "unit_price" and "unit_cost" columns of the "df_product" DataFrame. The method returns a NumPy array named "imp_values" that contains the imputed values.
The third line of the code creates a copy of the "df_product" DataFrame named "df_product_3". This is done to avoid modifying the original DataFrame.
The fourth line of the code assigns the imputed values to the "unit_price" and "unit_cost" columns of the "df_product_3" DataFrame. Theround()
method is called on the "imp_values" array to round off the values to the nearest integer.
The last line of the code checks if there are still any missing values in the "df_product_3" DataFrame by calling theisnull()
method and then thesum()
method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
P-083: Create new product data for the missing values of unit price(unit_price) and unit cost(unit cost), supplemented by the median value calculated for each product category sub-category code (category_small_cd). Note that the median value should be rounded to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no deficiencies for each item.
# Code example 1
df_tmp = (df_product.groupby('category_small_cd')
.agg(median_price=('unit_price', 'median'),
median_cost=('unit_cost', 'median')).reset_index())
df_product_4 = pd.merge(df_product, df_tmp, how='inner', on='category_small_cd')
df_product_4['unit_price'] = df_product_4[['unit_price', 'median_price']]. \
apply(lambda x: np.round(x[1]) if np.isnan(x[0]) else x[0], axis=1)
df_product_4['unit_cost'] = df_product_4[['unit_cost', 'median_cost']]. \
apply(lambda x: np.round(x[1]) if np.isnan(x[0]) else x[0], axis=1)
df_product_4.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 median_price 0 median_cost 0 dtype: int64
Commentary :
This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:
The first line groups the DataFrame "df_product" by the "category_small_cd" column and computes the median values of "unit_price" and "unit_cost" for each group. The resulting DataFrame is named "df_tmp" and has three columns: "category_small_cd", "median_price", and "median_cost".
The second line uses themerge()
method of the Pandas library to merge the "df_product" and "df_tmp" DataFrames based on the "category_small_cd" column. The resulting DataFrame, "df_product_4", contains all the columns of "df_product" and the "median_price" and "median_cost" columns from "df_tmp".
The third line creates a new column in "df_product_4" named "unit_price" that is computed by applying a lambda function to the "unit_price" and "median_price" columns. The lambda function checks if the "unit_price" column is missing (i.e., NaN) and, if so, returns the rounded value of "median_price". Otherwise, it returns the original value of "unit_price". This operation fills the missing values in the "unit_price" column.
The fourth line creates a new column in "df_product_4" named "unit_cost" that is computed by applying a similar lambda function to the "unit_cost" and "median_cost" columns. This operation fills the missing values in the "unit_cost" column.
The last line checks if there are still any missing values in the "df_product_4" DataFrame by calling theisnull()
method and then thesum()
method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
# Code example 2 (use of MASK)
df_tmp = (df_product.groupby('category_small_cd')
.agg(median_price=('unit_price', 'median'),
median_cost=('unit_cost', 'median')).reset_index())
df_product_4 = df_product.merge(df_tmp, how='inner', on='category_small_cd')
df_product_4['unit_price'] = (df_product_4['unit_price']
.mask(df_product_4['unit_price'].isnull(),
df_product_4['median_price'].round()))
df_product_4['unit_cost'] = (df_product_4['unit_cost']
.mask(df_product_4['unit_cost'].isnull(),
df_product_4['median_cost'].round()))
df_product_4.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 median_price 0 median_cost 0 dtype: int64
Commentary :
This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:
The first line groups the DataFrame "df_product" by the "category_small_cd" column and computes the median values of "unit_price" and "unit_cost" for each group. The resulting DataFrame is named "df_tmp" and has three columns: "category_small_cd", "median_price", and "median_cost".
The second line uses themerge()
method of the Pandas library to merge the "df_product" and "df_tmp" DataFrames based on the "category_small_cd" column. The resulting DataFrame, "df_product_4", contains all the columns of "df_product" and the "median_price" and "median_cost" columns from "df_tmp".
The third line creates a new column in "df_product_4" named "unit_price" that is computed using themask()
method of Pandas. Themask()
method sets the missing values (i.e., NaN) in "unit_price" to the rounded value of "median_price". This operation fills the missing values in the "unit_price" column.
The fourth line creates a new column in "df_product_4" named "unit_cost" that is computed in a similar way to "unit_price". This operation fills the missing values in the "unit_cost" column.
The last line checks if there are still any missing values in the "df_product_4" DataFrame by calling theisnull()
method and then thesum()
method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
# Code example 3 (utilising fillna and transform)
df_product_4 = df_product.copy()
for x in ['unit_price', 'unit_cost']:
df_product_4[x] = (df_product_4[x]
.fillna(df_product_4.groupby('category_small_cd')[x]
.transform('median')
.round()))
df_product_4.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:
The first line creates a copy of the original DataFrame "df_product" and names it "df_product_4".
The second line initiates a for loop to iterate over the columns "unit_price" and "unit_cost".
The third line fills the missing values in each column. It uses thefillna()
method to replace the missing values in each column with the median value of the respective category. This is done by grouping "df_product_4" by the "category_small_cd" column and computing the median value of the respective column using thetransform()
method. Theround()
method is used to round the median value to the nearest integer.
The fourth line checks if there are still any missing values in the "df_product_4" DataFrame by calling theisnull()
method and then thesum()
method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
Thus, the code imputes missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category, and creates a new DataFrame "df_product_4" with the imputed values. The imputation is done by copying the original DataFrame, filling missing values in each column using the median value of the respective category, and checking if there are any missing values remaining.
P-084: Prepare new data for all customers in the customer data (df_customer) by calculating the percentage of the 2019 sales value to the sales value for the whole period. However, if there are no sales results, treat them as 0. Then, extract the calculated percentages that exceed 0 and display the results for 10 cases. Also check that there are no missing data in the data created.
df_receipt_2019 = df_receipt.query('20190101 <= sales_ymd <= 20191231') \
.groupby('customer_id') \
.agg(amount_2019=('amount', 'sum')) \
.reset_index()
df_receipt_all = df_receipt.groupby('customer_id')\
.agg(amount_all=('amount', 'sum')) \
.reset_index()
df_sales_rate = df_customer[['customer_id']] \
.merge(df_receipt_2019, how='left', on='customer_id') \
.merge(df_receipt_all, how='left', on='customer_id')
df_sales_rate['amount_2019'] = df_sales_rate['amount_2019'].fillna(0)
df_sales_rate['amount_all'] = df_sales_rate['amount_all'].fillna(0)
df_sales_rate['amount_rate'] = \
df_sales_rate[['amount_2019','amount_all']] \
.apply(lambda x: 0 if x[0] == 0 else x[0] / x[1], axis=1)
df_sales_rate['amount_rate'] = df_sales_rate['amount_rate'].fillna(0)
df_sales_rate.query('amount_rate > 0').head(10)
customer_id | amount_2019 | amount_all | amount_rate | |
---|---|---|---|---|
2 | CS031415000172 | 2971.0 | 5088.0 | 0.583923 |
6 | CS015414000103 | 874.0 | 3122.0 | 0.279949 |
12 | CS011215000048 | 248.0 | 3444.0 | 0.072009 |
15 | CS029415000023 | 3767.0 | 5167.0 | 0.729050 |
21 | CS035415000029 | 5823.0 | 7504.0 | 0.775986 |
23 | CS023513000066 | 208.0 | 771.0 | 0.269780 |
24 | CS035513000134 | 463.0 | 1565.0 | 0.295847 |
27 | CS001515000263 | 216.0 | 216.0 | 1.000000 |
30 | CS006415000279 | 229.0 | 229.0 | 1.000000 |
32 | CS031415000106 | 215.0 | 7741.0 | 0.027774 |
df_tmp.isnull().sum()
category_small_cd 0 median_price 0 median_cost 0 dtype: int64
Commentary :
This code performs data processing on a Pandas DataFrame calleddf_receipt
to calculate the sales rate of customers in the year 2019 compared to their total sales. Here's a step-by-step explanation of what the code does:
The first line creates a new DataFrame calleddf_receipt_2019
that filters the rows ofdf_receipt
to only include sales from the year 2019 (20190101 <= sales_ymd <= 20191231
). It then groups the rows bycustomer_id
and calculates the sum of theamount
column for each group. The resulting DataFrame has two columns:customer_id
andamount_2019
.
The second line creates a new DataFrame calleddf_receipt_all
that groups the rows ofdf_receipt
bycustomer_id
and calculates the sum of theamount
column for each group. The resulting DataFrame has two columns:customer_id
andamount_all
.
The third line creates a new DataFrame calleddf_sales_rate
that merges thecustomer_id
column ofdf_customer
with theamount_2019
andamount_all
columns ofdf_receipt_2019
anddf_receipt_all
, respectively. The merge is a left join, meaning that all rows ofdf_customer
are included in the resulting DataFrame, but only the matching rows ofdf_receipt_2019
anddf_receipt_all
are included. The resulting DataFrame has three columns:customer_id
,amount_2019
, andamount_all
.
The fourth and fifth lines fill any missing values in theamount_2019
andamount_all
columns with 0.
The sixth line creates a new column calledamount_rate
indf_sales_rate
by applying a lambda function to theamount_2019
andamount_all
columns. The lambda function returns 0 ifamount_2019
is 0, and the ratio ofamount_2019
toamount_all
otherwise. The resultingamount_rate
column represents the sales rate of each customer in the year 2019 compared to their total sales.
The seventh line fills any missing values in theamount_rate
column with 0.
The eighth line filters the rows ofdf_sales_rate
to only include customers with a positiveamount_rate
, sorts the DataFrame byamount_rate
in descending order, and returns the first 10 rows. These are the top 10 customers with the highest sales rate in the year 2019 compared to their total sales.
P-085: Create new customer data by linking geocode data (df_geocode) to all customers in the customer data (df_customer) using postcode (postal_cd). However, if multiple longitude (longitude) and latitude (latitude) information is associated with a single postcode (postal_cd), calculate and use the average values of longitude (longitude) and latitude (latitude). Also, display 10 results to check the creation results.
df_geocode_1 = df_geocode.groupby('postal_cd') \
.agg(m_longitude=('longitude', 'mean'),
m_latitude=('latitude', 'mean')).reset_index()
df_customer_1 = pd.merge(df_customer, df_geocode_1,
how='inner', on='postal_cd')
df_customer_1.head(10)
customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | m_longitude | m_latitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | CS021313000114 | 大野 あや子 | 1 | 女性 | 1981-04-29 | 37 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150905 | 0-00000000-0 | 139.31779 | 35.41358 |
1 | CS021303000023 | 堀 一徳 | 0 | 男性 | 1980-04-05 | 38 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20160411 | 0-00000000-0 | 139.31779 | 35.41358 |
2 | CS021303000007 | 石倉 俊二 | 0 | 男性 | 1987-07-04 | 31 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150707 | 0-00000000-0 | 139.31779 | 35.41358 |
3 | CS021313000183 | 草野 未來 | 1 | 女性 | 1986-12-21 | 32 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20170611 | 0-00000000-0 | 139.31779 | 35.41358 |
4 | CS021314000098 | 筒井 れいな | 1 | 女性 | 1985-09-21 | 33 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20160901 | 0-00000000-0 | 139.31779 | 35.41358 |
5 | CS021314000093 | 江藤 美佐 | 1 | 女性 | 1986-06-03 | 32 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20151015 | 0-00000000-0 | 139.31779 | 35.41358 |
6 | CS021413000049 | 大野 幸子 | 1 | 女性 | 1973-04-17 | 45 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150703 | 0-00000000-0 | 139.31779 | 35.41358 |
7 | CS037613000071 | 六角 雅彦 | 9 | 不明 | 1952-04-01 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 | 139.83502 | 35.67193 |
8 | CS015415000209 | 大谷 倫子 | 1 | 女性 | 1970-11-25 | 48 | 136-0076 | 東京都江東区南砂********** | S13015 | 20150909 | B-20090610-C | 139.83502 | 35.67193 |
9 | CS037614000045 | 長沢 麻緒 | 1 | 女性 | 1952-06-19 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150209 | 5-20091207-6 | 139.83502 | 35.67193 |
Commentary :
This code performs data processing on two Pandas DataFrames calleddf_geocode
anddf_customer
. Here's a step-by-step explanation of what the code does:
The first line creates a new DataFrame calleddf_geocode_1
by grouping the rows ofdf_geocode
bypostal_cd
(postal code) and calculating the mean longitude and latitude for each group. The resulting DataFrame has three columns:postal_cd
,m_longitude
, andm_latitude
.
The second line creates a new DataFrame calleddf_customer_1
by mergingdf_customer
anddf_geocode_1
on thepostal_cd
column. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns ofdf_customer
plus them_longitude
andm_latitude
columns fromdf_geocode_1
.
The third line displays the first 10 rows of the resulting DataFrame (df_customer_1
) using the.head(10)
method. This shows a sample of the merged data that includes customer information with their corresponding postal code's mean longitude and latitude values.
P-086: For the customer data with latitude(latitude) and longitude(longitude) created in 085, combine it with shop data (df_store) using the member application shop code (application_store_cd) as key. Calculate the distance (in km) between the application shop and the customer address using the latitude(latitude) and longitude information (longitude) of the application shop and the latitude and longitude of the customer address (address), and display it together with the customer ID (customer_id), customer address and shop address (address). The following simple formula should be used as the calculation formula, but libraries using other highly accurate methods are also acceptable. Display 10 results.
Latitude(radians):φLongitude(radians):λDistance L=6371∗arccos(sinφ1∗sinφ2+cosφ1∗cosφ2∗cos(λ1−λ2))
# Code example 1
def calc_distance(x1, y1, x2, y2):
distance = 6371 * math.acos(math.sin(math.radians(x1))
* math.sin(math.radians(x2))
+ math.cos(math.radians(x1))
* math.cos(math.radians(x2))
* math.cos(math.radians(y1) - math.radians(y2)))
return distance
df_tmp = pd.merge(df_customer_1, df_store,
how='inner',
left_on='application_store_cd',
right_on='store_cd') \
.rename(columns={'address_x':'customer_address',
'address_y':'store_address'})
df_tmp['distance'] = df_tmp[['m_latitude',
'm_longitude',
'latitude',
'longitude']] \
.apply(lambda x: calc_distance(x[0], x[1], x[2], x[3]),
axis=1)
df_tmp[['customer_id', 'customer_address',
'store_address', 'distance']].head(10)
customer_id | customer_address | store_address | distance | |
---|---|---|---|---|
0 | CS021313000114 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
1 | CS021303000023 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
2 | CS021303000007 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
3 | CS021313000183 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
4 | CS021314000098 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
5 | CS021314000093 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
6 | CS021413000049 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
7 | CS021313000025 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
8 | CS021413000022 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
9 | CS021413000094 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
Commentary :
This code performs some geospatial calculations and data processing on three Pandas DataFrames calleddf_customer_1
,df_store
, anddf_tmp
. Here's a step-by-step explanation of what the code does:
The first line defines a Python function calledcalc_distance
that takes four arguments:x1
andy1
(latitude and longitude of point 1) andx2
andy2
(latitude and longitude of point 2). The function uses the Haversine formula to calculate the distance between the two points in kilometers, assuming a spherical earth with radius of 6371 km. The result is returned asdistance
.
The second line creates a new DataFrame calleddf_tmp
by mergingdf_customer_1
anddf_store
on theapplication_store_cd
andstore_cd
columns, respectively. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns of both DataFrames, plus some renamed columns:address_x
is renamed tocustomer_address
, andaddress_y
is renamed tostore_address
.
The third line creates a new column indf_tmp
calleddistance
by applying thecalc_distance
function to them_latitude
,m_longitude
,latitude
, andlongitude
columns using the.apply()
method. The.apply()
method is used with theaxis=1
argument to apply the function row-wise. The resulting distance value is stored in thedistance
column.
The fourth line selects a subset of columns fromdf_tmp
(customer_id
,customer_address
,store_address
, anddistance
) using double square brackets and displays the first 10 rows of the resulting DataFrame using the.head(10)
method. This shows a sample of the merged data that includes customer and store information with the corresponding distances between them.
# Code example 2
def calc_distance_numpy(x1, y1, x2, y2):
x1_r = np.radians(x1)
x2_r = np.radians(x2)
y1_r = np.radians(y1)
y2_r = np.radians(y2)
return 6371 * np.arccos(np.sin(x1_r) * np.sin(x2_r)
+ np.cos(x1_r) * np.cos(x2_r)
* np.cos(y1_r - y2_r))
df_tmp = df_customer_1.merge(df_store,
how='inner',
left_on='application_store_cd',
right_on='store_cd') \
.rename(columns={'address_x':'customer_address',
'address_y':'store_address'})
df_tmp['distance'] = calc_distance_numpy(df_tmp['m_latitude'],
df_tmp['m_longitude'],
df_tmp['latitude'],
df_tmp['longitude'])
df_tmp[['customer_id', 'customer_address',
'store_address', 'distance']].head(10)
customer_id | customer_address | store_address | distance | |
---|---|---|---|---|
0 | CS021313000114 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
1 | CS021303000023 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
2 | CS021303000007 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
3 | CS021313000183 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
4 | CS021314000098 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
5 | CS021314000093 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
6 | CS021413000049 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
7 | CS021313000025 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
8 | CS021413000022 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
9 | CS021413000094 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
Commentary :
This code performs some geospatial calculations and data processing on three Pandas DataFrames calleddf_customer_1
,df_store
, anddf_tmp
. Here's a step-by-step explanation of what the code does:
The first line defines a Python function calledcalc_distance_numpy
that takes four arguments:x1
andy1
(latitude and longitude of point 1) andx2
andy2
(latitude and longitude of point 2). The function uses the Haversine formula to calculate the distance between the two points in kilometers, assuming a spherical earth with radius of 6371 km. This implementation uses NumPy functions to perform the calculations more efficiently than the previous implementation.
The second line creates a new DataFrame calleddf_tmp
by mergingdf_customer_1
anddf_store
on theapplication_store_cd
andstore_cd
columns, respectively. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns of both DataFrames, plus some renamed columns:address_x
is renamed tocustomer_address
, andaddress_y
is renamed tostore_address
.
The third line creates a new column indf_tmp
calleddistance
by calling thecalc_distance_numpy
function on four columns ofdf_tmp
using NumPy array operations. Instead of using the.apply()
method, this implementation takes advantage of NumPy's ability to perform element-wise computations on arrays. The resulting distance value is stored in thedistance
column.
The fourth line selects a subset of columns fromdf_tmp
(customer_id
,customer_address
,store_address
, anddistance
) using double square brackets and displays the first 10 rows of the resulting DataFrame using the.head(10)
method. This shows a sample of the merged data that includes customer and store information with the corresponding distances between them.
P-087: In the customer data (df_customer), the same customer is registered more than once, e.g. due to applications at different shops. Create the named customer data, in which customers with the same name (customer_name) and postcode (postal_cd) are regarded as the same customer and are collated so that there is one record per customer, and calculate the number of customer data, the number of named customer data and the number of duplicates. However, for the same customer, the one with the highest total sales amount shall be retained, and for customers with the same total sales amount or no sales record, the one with the lowest customer ID (customer_id) number shall be retained.
df_receipt_tmp = df_receipt.groupby('customer_id') \
.agg(sum_amount=('amount','sum')).reset_index()
df_customer_u = pd.merge(df_customer, df_receipt_tmp,
how='left',
on='customer_id')
df_customer_u['sum_amount'] = df_customer_u['sum_amount'].fillna(0)
df_customer_u = df_customer_u.sort_values(['sum_amount', 'customer_id'],
ascending=[False, True])
df_customer_u.drop_duplicates(subset=['customer_name', 'postal_cd'],
keep='first', inplace=True)
print('df_customer_cnt:', len(df_customer),
'df_customer_u_cnt:', len(df_customer_u),
'diff:', len(df_customer) - len(df_customer_u))
df_customer_cnt: 21971
df_customer_u_cnt: 21941
diff: 30
Commentary :
This code performs the following tasks:
Group thedf_receipt
dataframe bycustomer_id
and calculate the sum ofamount
for each customer. The resulting dataframe is stored indf_receipt_tmp
.
Merge thedf_customer
dataframe with thedf_receipt_tmp
dataframe based on thecustomer_id
column. This is a left join, so all customers indf_customer
will be included in the resulting dataframedf_customer_u
.
Replace any missing values in thesum_amount
column with 0.
Sort the rows indf_customer_u
bysum_amount
in descending order, and then bycustomer_id
in ascending order.
Remove any duplicates indf_customer_u
based on thecustomer_name
andpostal_cd
columns, keeping only the first occurrence of each duplicate. The resulting dataframe is stored back indf_customer_u
.
Print out the number of rows in the originaldf_customer
dataframe (df_customer_cnt
), the number of rows in the updateddf_customer_u
dataframe (df_customer_u_cnt
), and the difference between the two (diff
).
Overall, this code is performing data cleaning and manipulation to create a new dataframedf_customer_u
that includes information about each customer's total purchase amount (sum_amount
), sorted by this amount in descending order, with any duplicates removed based on thecustomer_name
andpostal_cd
columns.
P-088: Based on the data created in 087, create data with integrated name IDs for the customer data. However, the integrated name ID shall be assigned according to the following specifications.
- Non-duplicated customers: set customer ID (customer_id)
- Duplicate customers: set the customer ID of the record extracted in the previous question
The difference between the number of unique customer IDs and the number of unique integration name-drop IDs shall also be checked.
df_customer_n = pd.merge(df_customer,
df_customer_u[['customer_name',
'postal_cd', 'customer_id']],
how='inner', on =['customer_name', 'postal_cd'])
df_customer_n.rename(columns={'customer_id_x':'customer_id',
'customer_id_y':'integration_id'}, inplace=True)
print('Difference in number of IDs', len(df_customer_n['customer_id'].unique())
- len(df_customer_n['integration_id'].unique()))
Difference in number of IDs 30
Commentary :
This code is merging two dataframes,df_customer
anddf_customer_u
, to identify and match unique customers based on their name and postal code.
The resulting merged dataframe is stored in a new dataframe calleddf_customer_n
. Themerge
function is used to join the two dataframes, and the argumenton=['customer_name', 'postal_cd']
specifies that the merge should be performed based on these two columns.
The resulting merged dataframe contains two customer ID columns,customer_id_x
andcustomer_id_y
. To avoid confusion, therename
function is used to rename these columns tocustomer_id
andintegration_id
, respectively.
The last line of code compares the number of unique customer IDs in thecustomer_id
column with the number of unique customer IDs in theintegration_id
column. If the difference is zero, it means that there are no duplicate customers in the merged dataframe. If the difference is non-zero, it means that there are still some customers that were not successfully matched based on their name and postal code, and their IDs are still duplicated.
P – Quiet: df_customer_1, df_customer_n are not used and should be deleted.
del df_customer_1
del df_customer_n
P-089: You want to split customers with sales records into training data and test data for building a forecasting model. Split the data randomly in the ratio of 8:2 respectively.
df_sales_customer = df_receipt.groupby('customer_id') \
.agg({'amount':sum}).reset_index()
df_sales_customer = df_sales_customer.query('amount > 0')
df_tmp = pd.merge(df_customer, df_sales_customer['customer_id'],
how='inner', on='customer_id')
df_train, df_test = train_test_split(df_tmp, test_size=0.2, random_state=71)
print('Percentage of training data: ', len(df_train) / len(df_tmp))
print('Test Data Percentage: ', len(df_test) / len(df_tmp))
Percentage of training data: 0.7999036840837949
Test Data Percentage: 0.20009631591620516
Commentary :
This code performs a train-test split of the data based on the sales information of each customer.
First, it aggregates the sales information of each customer from thedf_receipt
dataframe by grouping the dataframe bycustomer_id
and summing theamount
column. The resulting dataframe is stored indf_sales_customer
.
Next, it filters out the customers who did not make any purchases (i.e., customers withamount
equal to 0) using thequery
method, and updatesdf_sales_customer
to only include these customers.
Then, it performs an inner join ofdf_customer
with thecustomer_id
column ofdf_sales_customer
, to only keep customers who made purchases. The resulting dataframe is stored indf_tmp
.
Finally, it splitsdf_tmp
into training and test sets usingtrain_test_split
method from scikit-learn, with a test size of 0.2 and a random state of 71. It then prints the percentage of data that is in the training and test sets.
P-090: The receipt details data (df_receipt) has data from 1 Jan 2017 to 31 Oct 2019. Create three sets of data for building time-series models, 12 months for training and 6 months for testing, by aggregating the sales amount (amount) on a monthly basis.
# Code example 1 (home-made function)
df_ts_amount = df_receipt[['sales_ymd', 'amount']].copy()
df_ts_amount['sales_ym'] = df_ts_amount['sales_ymd'].astype('str').str[0:6]
df_ts_amount = df_ts_amount.groupby('sales_ym') \
.agg({'amount':'sum'}).reset_index()
# Functionalise large numbers of data sets for long-term data so that they can be processed, e.g. in loops.
def split_data(df, train_size, test_size, slide_window, start_point):
train_start = start_point * slide_window
test_start = train_start + train_size
return df[train_start:test_start], df[test_start:test_start + test_size]
df_train_1, df_test_1 = split_data(df_ts_amount, train_size=12,
test_size=6, slide_window=6, start_point=0)
df_train_2, df_test_2 = split_data(df_ts_amount, train_size=12,
test_size=6, slide_window=6, start_point=1)
df_train_3, df_test_3 = split_data(df_ts_amount, train_size=12,
test_size=6, slide_window=6, start_point=2)
# df_train_2 and df_train_3 display omitted
df_train_1
sales_ym | amount | |
---|---|---|
0 | 201701 | 902056 |
1 | 201702 | 764413 |
2 | 201703 | 962945 |
3 | 201704 | 847566 |
4 | 201705 | 884010 |
5 | 201706 | 894242 |
6 | 201707 | 959205 |
7 | 201708 | 954836 |
8 | 201709 | 902037 |
9 | 201710 | 905739 |
10 | 201711 | 932157 |
11 | 201712 | 939654 |
# df_test_2 and df_test_3 display omitted
df_test_1
sales_ym | amount | |
---|---|---|
12 | 201801 | 944509 |
13 | 201802 | 864128 |
14 | 201803 | 946588 |
15 | 201804 | 937099 |
16 | 201805 | 1004438 |
17 | 201806 | 1012329 |
Commentary :
The code is performing the following tasks:df_ts_amount
dataframe is created with columns 'sales_ymd', 'amount', and 'sales_ym'.sales_ym
column is created from the first six characters of thesales_ymd
column converted to a string.
Grouping thedf_ts_amount
dataframe by 'sales_ym' column and aggregating the 'amount' column with 'sum' function, and resetting the index. The resulting dataframe is stored in the same variable 'df_ts_amount'.
A functionsplit_data
is defined that takes in a dataframedf
, train size, test size, slide window, and start point as arguments. It returns two dataframes - train and test dataframes.
Thesplit_data
function is called three times with different arguments each time to split thedf_ts_amount
dataframe into train and test dataframes. Three sets of train and test dataframes are created and stored indf_train_1
,df_test_1
,df_train_2
,df_test_2
,df_train_3
, anddf_test_3
.
The purpose of splitting the data into multiple sets is probably to use them as training and validation sets for a time series forecasting model. By splitting the data into different sets, the model can be trained on different time periods, and the performance can be evaluated on different test periods. This helps to avoid overfitting and to get a more accurate estimate of the model's performance.
# Code example 2 (TimeSeriesSplit in scikit-learn)
tscv = TimeSeriesSplit(gap=0, max_train_size=12, n_splits=3, test_size=6)
# TimeSeriesSplit is split so that the most recent data is used, but.
# Adjust the data period so that it is the same as in the SQL and R solution examples
# Unnecessary if you want to use the latest data possible
df_ts_amount = df_ts_amount.query('sales_ym <= "201906"')
series_list = []
for train_index, test_index in tscv.split(df_ts_amount):
series_list.append((df_ts_amount.loc[train_index],
df_ts_amount.loc[test_index]))
df_train_1, df_test_1 = series_list[0]
df_train_2, df_test_2 = series_list[1]
df_train_3, df_test_3 = series_list[2]
# df_train_2 and df_train_3 display omitted
df_train_1
sales_ym | amount | |
---|---|---|
0 | 201701 | 902056 |
1 | 201702 | 764413 |
2 | 201703 | 962945 |
3 | 201704 | 847566 |
4 | 201705 | 884010 |
5 | 201706 | 894242 |
6 | 201707 | 959205 |
7 | 201708 | 954836 |
8 | 201709 | 902037 |
9 | 201710 | 905739 |
10 | 201711 | 932157 |
11 | 201712 | 939654 |
# df_test_2 and df_test_3 display omitted
df_test_1
sales_ym | amount | |
---|---|---|
12 | 201801 | 944509 |
13 | 201802 | 864128 |
14 | 201803 | 946588 |
15 | 201804 | 937099 |
16 | 201805 | 1004438 |
17 | 201806 | 1012329 |
Commentary :
This code is performing time series cross-validation on a pandas DataFramedf_ts_amount
. It uses theTimeSeriesSplit
function from thesklearn.model_selection
module to generate three train-test splits.TimeSeriesSplit
is a cross-validator that provides train/test indices to split time series data samples. It takes four parameters:n_splits
: the number of splitsmax_train_size
: maximum size of training datatest_size
: size of the test datagap
: the number of samples to skip before starting a new split
The code setsn_splits=3
,max_train_size=12
,test_size=6
, andgap=0
to create three splits with a training period of 12 months and a testing period of 6 months.
The code then filters the data to only include sales data up to June 2019 (df_ts_amount.query('sales_ym <= "201906"')
) and creates a listseries_list
to store the train and test sets generated byTimeSeriesSplit
. The for loop iterates over the splits generated byTimeSeriesSplit
and appends each split toseries_list
. Finally, the code assigns the first train-test split todf_train_1
anddf_test_1
, the second split todf_train_2
anddf_test_2
, and the third split todf_train_3
anddf_test_3
.
P-091: For each customer in the customer data (df_customer), under-sample the number of customers with sales performance and the number of customers without sales performance so that the ratio is 1:1.
df_tmp = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_tmp = pd.merge(df_customer, df_tmp, how='left', on='customer_id')
df_tmp['is_buy_flag'] = np.where(df_tmp['amount'].isnull(), 0, 1)
rs = RandomUnderSampler(random_state=71)
df_down_sampling, _ = rs.fit_resample(df_tmp, df_tmp.is_buy_flag)
print('Number of cases 0', len(df_down_sampling.query('is_buy_flag == 0')))
print('Number of cases 1', len(df_down_sampling.query('is_buy_flag == 1')))
Number of cases 0 8306 Number of cases 1 8306
Commentary :
This code performs random under-sampling to balance the binary target variableis_buy_flag
.
First, the total purchase amount for each customer is computed using groupby and is stored indf_tmp
. Then,df_customer
anddf_tmp
are merged using a left join to include all customers, even if they have not made any purchases. Theis_buy_flag
column is created by using np.where to assign 1 to customers who made purchases and 0 to those who did not.
Next, theRandomUnderSampler
function from theimblearn
library is used to balance the target variable. Thefit_resample
method is applied todf_tmp
with theis_buy_flag
column as the target variable. This creates a new DataFrame,df_down_sampling
, with an equal number of cases for both target variable values.
Finally, the number of cases withis_buy_flag
values of 0 and 1 are printed to confirm that the random under-sampling was successful.
P-092: Normalise to third normal form for gender in customer data (df_customer).
df_gender_std = df_customer[['gender_cd', 'gender']].drop_duplicates()
df_customer_std = df_customer.drop(columns='gender')
# Check data content
df_customer_std.head(3)
customer_id | customer_name | gender_cd | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
---|---|---|---|---|---|---|---|---|---|---|
0 | CS021313000114 | 大野 あや子 | 1 | 1981-04-29 | 37 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150905 | 0-00000000-0 |
1 | CS037613000071 | 六角 雅彦 | 9 | 1952-04-01 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 |
2 | CS031415000172 | 宇多田 貴美子 | 1 | 1976-10-04 | 42 | 151-0053 | 東京都渋谷区代々木********** | S13031 | 20150529 | D-20100325-C |
# Check data content
df_gender_std.head(3)
gender_cd | gender | |
---|---|---|
0 | 1 | 女性 |
1 | 9 | 不明 |
5 | 0 | 男性 |
Commentary :
In this code,df_gender_std
is created by selecting the 'gender_cd' and 'gender' columns from thedf_customer
dataframe, and then removing duplicate rows using thedrop_duplicates()
method. This creates a new dataframe with unique combinations of 'gender_cd' and 'gender' values.df_customer_std
is created by dropping the 'gender' column from thedf_customer
dataframe using thedrop()
method with thecolumns
parameter set to 'gender'. This creates a new dataframe with all the columns fromdf_customer
except the 'gender' column.
The purpose of this code is to create two separate dataframes, one with unique combinations of 'gender_cd' and 'gender', and another with all columns fromdf_customer
except 'gender'. These dataframes can then be used for further analysis or modeling.
P-093: Product data (df_product) only holds the code values for each category, but not the category names. Combine with category data (df_category) and denormalise to create new product data that holds category names.
df_product_full = pd.merge(df_product, df_category[['category_small_cd',
'category_major_name',
'category_medium_name',
'category_small_name']],
how = 'inner', on = 'category_small_cd')
# Check data content
df_product_full.head(3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
This code performs an inner join between two dataframesdf_product
anddf_category
on the columncategory_small_cd
. The resulting dataframedf_product_full
contains all the columns fromdf_product
anddf_category
, and the join is based on the matching values ofcategory_small_cd
.
The new dataframedf_product_full
also contains additional columns fromdf_category
, specificallycategory_major_name
,category_medium_name
, andcategory_small_name
. These columns contain categorical information about the product category, and they are included in the merged dataframe using thehow='inner'
option, which keeps only the rows that have matching values in both dataframes. The.head(3)
method is used to display the first 3 rows of the resulting merged dataframe.
P-094: File out the product data with category names created in 093 with the following specifications.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|
The path of the file output destination shall be as follows
|Output| |:–:| |./data|
# Code example 1.
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_UTF-8_header.csv',
encoding='UTF-8', index=False)
Commentary :
This code exports the Pandas DataFramedf_product_full
to a CSV file with a specified file path and name../data/P_df_product_full_UTF-8_header.csv
. The exported CSV file will use UTF-8 encoding to support a wide range of characters, and the argumentindex=False
specifies that the row index of the DataFrame should not be included in the exported CSV file.
# Code example 2 (with BOM to prevent garbled Excel)
df_product_full.to_csv('../data/P_df_product_full_UTF-8BOM_header.csv',
encoding='utf_8_sig', index=False)
Commentary :
This line of code exports thedf_product_full
DataFrame to a CSV file with a filenameP_df_product_full_UTF-8BOM_header.csv
.
Theto_csv()
method of a pandas DataFrame is used to write the contents of the DataFrame to a CSV file. It has several parameters, including the filename and the encoding to use.
In this case, thefilename
parameter is set to'../data/P_df_product_full_UTF-8BOM_header.csv'
, which specifies the directory and filename of the CSV file. The'../data/'
prefix specifies the relative directory in which the file should be saved.
Theencoding
parameter is set to'utf_8_sig'
, which specifies the encoding format to use for the CSV file.'utf_8_sig'
is a variant of UTF-8 encoding that includes a byte order mark (BOM) at the beginning of the file. The BOM is a special marker that indicates the byte order of the file, and can help programs determine the encoding format when opening the file.
Theindex
parameter is set toFalse
, which specifies that the DataFrame index should not be included in the CSV file.
P-095: Output the product data with category names created in 093 to a file with the following specifications.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|CP932|
The path of the file output destination shall be as follows
|Output| |:–:| |./data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_CP932_header.csv',
encoding='CP932', index=False)
Commentary :
This line of code exports a pandas DataFramedf_product_full
to a CSV file named "P_df_product_full_CP932_header.csv" in the "../data/" directory.
The exported CSV file uses the "CP932" character encoding, which is a Japanese character encoding also known as "Windows-31J". This encoding is commonly used in Japan for text data and is compatible with Windows operating systems.
The parameterencoding='CP932'
specifies the character encoding to use when exporting the CSV file.
The parameterindex=False
specifies that the index column should not be included in the exported CSV file.
P-096: Output the product data with category names created in 093 to a file with the following specifications.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|without|UTF-8|
The path of the file output destination shall be as follows
|Output| |:–:| |./data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_UTF-8_noh.csv',
header=False, encoding='UTF-8', index=False)
Commentary :
This code saves thedf_product_full
DataFrame to a CSV file called "P_df_product_full_UTF-8_noh.csv" in the "../data/" directory.
The first argument specifies the filename and path where the CSV file will be saved. Theheader
argument is set toFalse
, which means that the header row of the DataFrame will not be included in the output file. Theencoding
argument is set to'UTF-8'
, which specifies the character encoding to use when writing the file. Finally,index
is set toFalse
, which means that the index of the DataFrame will not be included in the output file.
Overall, this code exports thedf_product_full
DataFrame to a CSV file without the header row and with UTF-8 encoding.
P-097: Load the file in the following format created in 094, display three data items and check that they have been imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_header.csv',
dtype={'category_major_cd':str,
'category_medium_cd':str,
'category_small_cd':str},
encoding='UTF-8')
df_product_full.head(3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code above reads a CSV fileP_df_product_full_UTF-8_noh.csv
with UTF-8 encoding and no header row, and assigns column names to the dataframe.pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv', dtype={1:str, 2:str, 3:str}, encoding='UTF-8', header=None)
: This reads the CSV file without a header row, and sets the second, third and fourth columns to string data type. The resulting dataframe has columns named 0, 1, 2, 3, 4, 5, 6, 7 and 8.df_product_full.columns = ['product_cd','category_major_cd', 'category_medium_cd', 'category_small_cd', 'unit_price','unit_cost','category_major_name', 'category_medium_name', 'category_small_name']
: This assigns new column names to the dataframe columns. The resulting dataframe has columns namedproduct_cd
,category_major_cd
,category_medium_cd
,category_small_cd
,unit_price
,unit_cost
,category_major_name
,category_medium_name
, andcategory_small_name
.df_product_full.head(3)
: This displays the first 3 rows of the dataframedf_product_full
.
P-098: Load the file in the following format created in 096, display three data items and check that they are imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|Without|UTF-8|
# Code example 1 (item names to be added later).
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv',
dtype={1:str,
2:str,
3:str},
encoding='UTF-8', header=None)
df_product_full.columns = ['product_cd','category_major_cd',
'category_medium_cd', 'category_small_cd',
'unit_price','unit_cost','category_major_name',
'category_medium_name', 'category_small_name']
df_product_full.head(3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
This code reads a CSV file named 'P_df_product_full_UTF-8_noh.csv' located in the '../data' directory. It specifies the column names using the 'names' parameter as a list of strings called 'c_names'. The 'dtype' parameter is used to specify the data type of the columns named 'category_major_cd', 'category_medium_cd', and 'category_small_cd' as 'str' (string). The 'encoding' parameter specifies that the CSV file is encoded using 'UTF-8'. Finally, the 'header' parameter is set to 'None' to indicate that the file does not have a header row.
The resulting dataframe is assigned to the variable 'df_product_full' and the first three rows are displayed using the 'head' method.
# Code example 2 (define item name first)
c_names = ['product_cd','category_major_cd','category_medium_cd',
'category_small_cd','unit_price','unit_cost',
'category_major_name','category_medium_name','category_small_name']
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv',
names=c_names,
dtype={'category_major_cd':str,
'category_medium_cd':str,
'category_small_cd':str},
encoding='UTF-8', header=None)
df_product_full.head(3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
This code reads a CSV file into a Pandas DataFrame with customized column names using theread_csv
function.
Here is an explanation of the code:c_names
: A list of strings that define the customized column names.pd.read_csv()
: A Pandas function to read a CSV file and return a DataFrame.'../data/P_df_product_full_UTF-8_noh.csv'
: The file path of the CSV file to be read.names=c_names
: Specifies the customized column names using thec_names
list.dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}
: Specifies that the data type of the columnscategory_major_cd
,category_medium_cd
, andcategory_small_cd
should bestr
.encoding='UTF-8'
: Specifies that the file encoding should be 'UTF-8'.header=None
: Specifies that there is no header row in the CSV file.
Finally, the resulting DataFrame is displayed using thehead()
method to show the first three rows.
P-099: Output the product data with category names created in 093 to a file with the following specifications.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|
The path of the file output destination shall be as follows
|Output| |:–:| |./data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../d/P_df_product_full_UTF-8_header.tsv',
sep='\t', encoding='UTF-8', index=False)
Commentary :
The code is writing the contents of a pandas DataFrame calleddf_product_full
to a TSV (Tab-Separated Values) file located at'../d/P_df_product_full_UTF-8_header.tsv'
.
The.to_csv()
method is called on thedf_product_full
DataFrame, which is a method provided by pandas for exporting data to a CSV (Comma-Separated Values) or TSV file.
The first argument to the method,'../d/P_df_product_full_UTF-8_header.tsv'
, is the file path and name where the DataFrame should be saved. The file will be saved in the../d/
directory with the nameP_df_product_full_UTF-8_header.tsv
.
The second argument,sep='\t'
, specifies that the values in the file should be separated by tabs instead of commas.
The third argument,encoding='UTF-8'
, specifies that the encoding of the file should be UTF-8, which is a standard encoding for Unicode characters.
The fourth argument,index=False
, specifies that the DataFrame's index should not be included in the output file.
P-100: Load the file in the following format created in 099, display three items of data and check that they have been imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|
# Code example 1 (read_table).
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_table('../data/P_df_product_full_UTF-8_header.tsv',
dtype={'category_major_cd':str,
'category_medium_cd':str,
'category_small_cd':str},
encoding='UTF-8')
df_product_full.head(3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code reads data from a TSV (Tab-Separated Values) file located at'../data/P_df_product_full_UTF-8_header.tsv'
into a pandas DataFrame calleddf_product_full
.
Thepd.read_table()
function is called to read the data from the file. The first argument to the function,'../data/P_df_product_full_UTF-8_header.tsv'
, specifies the path and name of the file to read.
The second argument to the function,dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}
, specifies the data types for certain columns in the DataFrame. Specifically, the columnscategory_major_cd
,category_medium_cd
, andcategory_small_cd
are being set to be of type string (str
).
The third argument to the function,encoding='UTF-8'
, specifies that the encoding of the file is UTF-8.
The resulting DataFrame,df_product_full
, is then displayed with thehead(3)
method, which shows the first three rows of the DataFrame.
# Code example 2 (read_csv)
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_header.tsv',
dtype={'category_major_cd':str,
'category_medium_cd':str,
'category_small_cd':str},
sep='\t', encoding='UTF-8')
df_product_full.head(3)
product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
---|---|---|---|---|---|---|---|---|---|
0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code reads data from a TSV (Tab-Separated Values) file located at'../data/P_df_product_full_UTF-8_header.tsv'
into a pandas DataFrame calleddf_product_full
.
Thepd.read_csv()
function is called to read the data from the file. The first argument to the function,'../data/P_df_product_full_UTF-8_header.tsv'
, specifies the path and name of the file to read.
The second argument to the function,dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}
, specifies the data types for certain columns in the DataFrame. Specifically, the columnscategory_major_cd
,category_medium_cd
, andcategory_small_cd
are being set to be of type string (str
).
The third argument to the function,sep='\t'
, specifies that the values in the file are separated by tabs instead of commas.
The fourth argument to the function,encoding='UTF-8'
, specifies that the encoding of the file is UTF-8.
The resulting DataFrame,df_product_full
, is then displayed with thehead(3)
method, which shows the first three rows of the DataFrame.
This is the end of the 100. Well done!
Comment