参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック(構造化データ加工編)」
The Data Scientist Society Github :
Data Science 100 Knocks (Structured Data Processing) URL :
-
Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
-
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
-
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Getting Started
- First, run the following cells
- Import the necessary libraries and read data from the database (PostgreSQL)
- The following cell imports libraries that are expected to be used, such as pandas.
- If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
- The process can be divided into multiple times.
- Names, addresses, etc. are dummy data and are not real.
import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler
if 'PG_PORT' in os.environ:
pgconfig = {
'host': 'db',
'port': os.environ['PG_PORT'],
'database': os.environ['PG_DATABASE'],
'user': os.environ['PG_USER'],
'password': os.environ['PG_PASSWORD'],
}
# Connector for pd.read_sql
conn = psycopg2.connect(**pgconfig)
df_customer = pd.read_sql(sql='select * from customer', con=conn)
df_category = pd.read_sql(sql='select * from category', con=conn)
df_product = pd.read_sql(sql='select * from product', con=conn)
df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
df_store = pd.read_sql(sql='select * from store', con=conn)
df_geocode = pd.read_sql(sql='select * from geocode', con=conn)
else:
if not os.path.exists('data/'):
!git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
os.chdir('100knocks-preprocess/docker/work/')
dtype = {
'customer_id': str,
'gender_cd': str,
'postal_cd': str,
'application_store_cd': str,
'status_cd': str,
'category_major_cd': str,
'category_medium_cd': str,
'category_small_cd': str,
'product_cd': str,
'store_cd': str,
'prefecture_cd': str,
'tel_no': str,
'postal_cd': str,
'street': str
}
df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
df_category = pd.read_csv("data/category.csv", dtype=dtype)
df_product = pd.read_csv("data/product.csv", dtype=dtype)
df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
df_store = pd.read_csv("data/store.csv", dtype=dtype)
df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)
fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.
Exercise
P-081: Create new product data for the missing values of unit_price(unit_price) and unit_cost(unit_cost),supplemented by the average value of each. Note that for the average values, round off values less than one yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
# Code example 1 (fillna in Pandas)
df_product_2 = df_product.fillna({
'unit_price':np.round(np.nanmean(df_product['unit_price'])),
'unit_cost':np.round(np.nanmean(df_product['unit_cost']))})
df_product_2.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
This code performs two operations on a pandas DataFrame nameddf_product:
It fills missing values in theunit_priceandunit_costcolumns with the rounded mean of each column using thefillna()method. The rounded mean is calculated using thenp.round()andnp.nanmean()functions from the NumPy library. Thefillna()method takes a dictionary as its argument, where the keys are the column names and the values are the values to fill missing values with.
It then checks for any remaining missing values in the DataFrame using theisnull()method, and sums up the number of missing values in each column using thesum()method. The resulting output is the number of missing values in each column of the updated DataFramedf_product_2.
The updated DataFramedf_product_2will have missing values in theunit_priceandunit_costcolumns replaced with their respective rounded mean values, and the output of the second line will show how many missing values remain in the DataFrame after the fillna operation.
# Code example 2 (SimpleImputer in scikit-learn)
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_values = imp_mean.fit_transform(df_product[['unit_price', 'unit_cost']])
df_product_2 = df_product.copy()
df_product_2[['unit_price', 'unit_cost']] = imp_values.round()
df_product_2.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
This code performs the following operations:
It imports theSimpleImputerclass from thesklearn.imputemodule. TheSimpleImputerclass is used for imputing missing values in a dataset.
It creates an instance ofSimpleImputercalledimp_mean. This imputer will replace missing values (which are represented asnp.nanin this case) with the mean value of the corresponding column. The strategy for imputation is set to'mean'.
It applies theSimpleImputerobject to theunit_priceandunit_costcolumns of thedf_productDataFrame using thefit_transform()method, which returns the transformed DataFrame with missing values imputed.
It creates a copy of the originaldf_productDataFrame calleddf_product_2using thecopy()method.
It replaces theunit_priceandunit_costcolumns ofdf_product_2with the imputed values by assigningimp_values.round()to those columns. Theround()method is used to round the imputed values to the nearest integer.
It checks for any remaining missing values in the DataFrame using theisnull()method, and sums up the number of missing values in each column using thesum()method. The resulting output is the number of missing values in each column of the updated DataFramedf_product_2.
The final output of this code will show the number of missing values in the updateddf_product_2DataFrame after imputing missing values with the mean of each column. This approach can be useful for handling missing data in a dataset, especially if there are only a small number of missing values.
P-082: Create new product data for missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the median value of each. Note that for the median value, round off the figures to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.
# Code example 1 (fillna in Pandas)
df_product_3 = df_product.fillna({
'unit_price':np.round(np.nanmedian(df_product['unit_price'])),
'unit_cost':np.round(np.nanmedian(df_product['unit_cost']))})
df_product_3.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
The code first fills the missing values in two columns, "unit_price" and "unit_cost", of the DataFrame named "df_product". It fills the missing values using thefillnamethod with the following arguments:
The first argument is a dictionary with two keys, "unit_price" and "unit_cost", each mapping to a corresponding value. These values are computed using the NumPy library'snanmedianfunction applied to the respective columns of the DataFrame.nanmediancalculates the median value of the column after ignoring any missing values represented by NaN.
The second argument is omitted in this code, meaning that any missing values in the other columns of the DataFrame are left untouched.
The result of this operation is a new DataFrame called "df_product_3".
The next line of code checks if there are still any missing values in the "df_product_3" DataFrame by calling theisnull()method and then thesum()method. This returns a Series object that contains the number of missing values for each column of the DataFrame. If the output shows that there are no missing values in either column, it confirms that the fillna operation was successful.
# Code example 2 (SimpleImputer in scikit-learn)
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')
imp_values = imp_mean.fit_transform(df_product[['unit_price', 'unit_cost']])
df_product_3 = df_product.copy()
df_product_3[['unit_price', 'unit_cost']] = imp_values.round()
df_product_3.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using scikit-learn's SimpleImputer.
The first line of the code initializes an instance of SimpleImputer named "imp_mean". This class provides a way to impute missing values with a given strategy, which in this case is the median of the non-missing values. Themissing_valuesargument specifies the value to be considered as missing, which is NaN in this case.
The second line of the code uses thefit_transformmethod of the "imp_mean" object to impute missing values in the "unit_price" and "unit_cost" columns of the "df_product" DataFrame. The method returns a NumPy array named "imp_values" that contains the imputed values.
The third line of the code creates a copy of the "df_product" DataFrame named "df_product_3". This is done to avoid modifying the original DataFrame.
The fourth line of the code assigns the imputed values to the "unit_price" and "unit_cost" columns of the "df_product_3" DataFrame. Theround()method is called on the "imp_values" array to round off the values to the nearest integer.
The last line of the code checks if there are still any missing values in the "df_product_3" DataFrame by calling theisnull()method and then thesum()method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
P-083: Create new product data for the missing values of unit price(unit_price) and unit cost(unit cost), supplemented by the median value calculated for each product category sub-category code (category_small_cd). Note that the median value should be rounded to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no deficiencies for each item.
# Code example 1
df_tmp = (df_product.groupby('category_small_cd')
.agg(median_price=('unit_price', 'median'),
median_cost=('unit_cost', 'median')).reset_index())
df_product_4 = pd.merge(df_product, df_tmp, how='inner', on='category_small_cd')
df_product_4['unit_price'] = df_product_4[['unit_price', 'median_price']]. \
apply(lambda x: np.round(x[1]) if np.isnan(x[0]) else x[0], axis=1)
df_product_4['unit_cost'] = df_product_4[['unit_cost', 'median_cost']]. \
apply(lambda x: np.round(x[1]) if np.isnan(x[0]) else x[0], axis=1)
df_product_4.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 median_price 0 median_cost 0 dtype: int64
Commentary :
This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:
The first line groups the DataFrame "df_product" by the "category_small_cd" column and computes the median values of "unit_price" and "unit_cost" for each group. The resulting DataFrame is named "df_tmp" and has three columns: "category_small_cd", "median_price", and "median_cost".
The second line uses themerge()method of the Pandas library to merge the "df_product" and "df_tmp" DataFrames based on the "category_small_cd" column. The resulting DataFrame, "df_product_4", contains all the columns of "df_product" and the "median_price" and "median_cost" columns from "df_tmp".
The third line creates a new column in "df_product_4" named "unit_price" that is computed by applying a lambda function to the "unit_price" and "median_price" columns. The lambda function checks if the "unit_price" column is missing (i.e., NaN) and, if so, returns the rounded value of "median_price". Otherwise, it returns the original value of "unit_price". This operation fills the missing values in the "unit_price" column.
The fourth line creates a new column in "df_product_4" named "unit_cost" that is computed by applying a similar lambda function to the "unit_cost" and "median_cost" columns. This operation fills the missing values in the "unit_cost" column.
The last line checks if there are still any missing values in the "df_product_4" DataFrame by calling theisnull()method and then thesum()method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
# Code example 2 (use of MASK)
df_tmp = (df_product.groupby('category_small_cd')
.agg(median_price=('unit_price', 'median'),
median_cost=('unit_cost', 'median')).reset_index())
df_product_4 = df_product.merge(df_tmp, how='inner', on='category_small_cd')
df_product_4['unit_price'] = (df_product_4['unit_price']
.mask(df_product_4['unit_price'].isnull(),
df_product_4['median_price'].round()))
df_product_4['unit_cost'] = (df_product_4['unit_cost']
.mask(df_product_4['unit_cost'].isnull(),
df_product_4['median_cost'].round()))
df_product_4.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 median_price 0 median_cost 0 dtype: int64
Commentary :
This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:
The first line groups the DataFrame "df_product" by the "category_small_cd" column and computes the median values of "unit_price" and "unit_cost" for each group. The resulting DataFrame is named "df_tmp" and has three columns: "category_small_cd", "median_price", and "median_cost".
The second line uses themerge()method of the Pandas library to merge the "df_product" and "df_tmp" DataFrames based on the "category_small_cd" column. The resulting DataFrame, "df_product_4", contains all the columns of "df_product" and the "median_price" and "median_cost" columns from "df_tmp".
The third line creates a new column in "df_product_4" named "unit_price" that is computed using themask()method of Pandas. Themask()method sets the missing values (i.e., NaN) in "unit_price" to the rounded value of "median_price". This operation fills the missing values in the "unit_price" column.
The fourth line creates a new column in "df_product_4" named "unit_cost" that is computed in a similar way to "unit_price". This operation fills the missing values in the "unit_cost" column.
The last line checks if there are still any missing values in the "df_product_4" DataFrame by calling theisnull()method and then thesum()method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
# Code example 3 (utilising fillna and transform)
df_product_4 = df_product.copy()
for x in ['unit_price', 'unit_cost']:
df_product_4[x] = (df_product_4[x]
.fillna(df_product_4.groupby('category_small_cd')[x]
.transform('median')
.round()))
df_product_4.isnull().sum()
product_cd 0 category_major_cd 0 category_medium_cd 0 category_small_cd 0 unit_price 0 unit_cost 0 dtype: int64
Commentary :
This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:
The first line creates a copy of the original DataFrame "df_product" and names it "df_product_4".
The second line initiates a for loop to iterate over the columns "unit_price" and "unit_cost".
The third line fills the missing values in each column. It uses thefillna()method to replace the missing values in each column with the median value of the respective category. This is done by grouping "df_product_4" by the "category_small_cd" column and computing the median value of the respective column using thetransform()method. Theround()method is used to round the median value to the nearest integer.
The fourth line checks if there are still any missing values in the "df_product_4" DataFrame by calling theisnull()method and then thesum()method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
Thus, the code imputes missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category, and creates a new DataFrame "df_product_4" with the imputed values. The imputation is done by copying the original DataFrame, filling missing values in each column using the median value of the respective category, and checking if there are any missing values remaining.
P-084: Prepare new data for all customers in the customer data (df_customer) by calculating the percentage of the 2019 sales value to the sales value for the whole period. However, if there are no sales results, treat them as 0. Then, extract the calculated percentages that exceed 0 and display the results for 10 cases. Also check that there are no missing data in the data created.
df_receipt_2019 = df_receipt.query('20190101 <= sales_ymd <= 20191231') \
.groupby('customer_id') \
.agg(amount_2019=('amount', 'sum')) \
.reset_index()
df_receipt_all = df_receipt.groupby('customer_id')\
.agg(amount_all=('amount', 'sum')) \
.reset_index()
df_sales_rate = df_customer[['customer_id']] \
.merge(df_receipt_2019, how='left', on='customer_id') \
.merge(df_receipt_all, how='left', on='customer_id')
df_sales_rate['amount_2019'] = df_sales_rate['amount_2019'].fillna(0)
df_sales_rate['amount_all'] = df_sales_rate['amount_all'].fillna(0)
df_sales_rate['amount_rate'] = \
df_sales_rate[['amount_2019','amount_all']] \
.apply(lambda x: 0 if x[0] == 0 else x[0] / x[1], axis=1)
df_sales_rate['amount_rate'] = df_sales_rate['amount_rate'].fillna(0)
df_sales_rate.query('amount_rate > 0').head(10)
| customer_id | amount_2019 | amount_all | amount_rate | |
|---|---|---|---|---|
| 2 | CS031415000172 | 2971.0 | 5088.0 | 0.583923 |
| 6 | CS015414000103 | 874.0 | 3122.0 | 0.279949 |
| 12 | CS011215000048 | 248.0 | 3444.0 | 0.072009 |
| 15 | CS029415000023 | 3767.0 | 5167.0 | 0.729050 |
| 21 | CS035415000029 | 5823.0 | 7504.0 | 0.775986 |
| 23 | CS023513000066 | 208.0 | 771.0 | 0.269780 |
| 24 | CS035513000134 | 463.0 | 1565.0 | 0.295847 |
| 27 | CS001515000263 | 216.0 | 216.0 | 1.000000 |
| 30 | CS006415000279 | 229.0 | 229.0 | 1.000000 |
| 32 | CS031415000106 | 215.0 | 7741.0 | 0.027774 |
df_tmp.isnull().sum()
category_small_cd 0 median_price 0 median_cost 0 dtype: int64
Commentary :
This code performs data processing on a Pandas DataFrame calleddf_receiptto calculate the sales rate of customers in the year 2019 compared to their total sales. Here's a step-by-step explanation of what the code does:
The first line creates a new DataFrame calleddf_receipt_2019that filters the rows ofdf_receiptto only include sales from the year 2019 (20190101 <= sales_ymd <= 20191231). It then groups the rows bycustomer_idand calculates the sum of theamountcolumn for each group. The resulting DataFrame has two columns:customer_idandamount_2019.
The second line creates a new DataFrame calleddf_receipt_allthat groups the rows ofdf_receiptbycustomer_idand calculates the sum of theamountcolumn for each group. The resulting DataFrame has two columns:customer_idandamount_all.
The third line creates a new DataFrame calleddf_sales_ratethat merges thecustomer_idcolumn ofdf_customerwith theamount_2019andamount_allcolumns ofdf_receipt_2019anddf_receipt_all, respectively. The merge is a left join, meaning that all rows ofdf_customerare included in the resulting DataFrame, but only the matching rows ofdf_receipt_2019anddf_receipt_allare included. The resulting DataFrame has three columns:customer_id,amount_2019, andamount_all.
The fourth and fifth lines fill any missing values in theamount_2019andamount_allcolumns with 0.
The sixth line creates a new column calledamount_rateindf_sales_rateby applying a lambda function to theamount_2019andamount_allcolumns. The lambda function returns 0 ifamount_2019is 0, and the ratio ofamount_2019toamount_allotherwise. The resultingamount_ratecolumn represents the sales rate of each customer in the year 2019 compared to their total sales.
The seventh line fills any missing values in theamount_ratecolumn with 0.
The eighth line filters the rows ofdf_sales_rateto only include customers with a positiveamount_rate, sorts the DataFrame byamount_ratein descending order, and returns the first 10 rows. These are the top 10 customers with the highest sales rate in the year 2019 compared to their total sales.
P-085: Create new customer data by linking geocode data (df_geocode) to all customers in the customer data (df_customer) using postcode (postal_cd). However, if multiple longitude (longitude) and latitude (latitude) information is associated with a single postcode (postal_cd), calculate and use the average values of longitude (longitude) and latitude (latitude). Also, display 10 results to check the creation results.
df_geocode_1 = df_geocode.groupby('postal_cd') \
.agg(m_longitude=('longitude', 'mean'),
m_latitude=('latitude', 'mean')).reset_index()
df_customer_1 = pd.merge(df_customer, df_geocode_1,
how='inner', on='postal_cd')
df_customer_1.head(10)
| customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | m_longitude | m_latitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CS021313000114 | 大野 あや子 | 1 | 女性 | 1981-04-29 | 37 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150905 | 0-00000000-0 | 139.31779 | 35.41358 |
| 1 | CS021303000023 | 堀 一徳 | 0 | 男性 | 1980-04-05 | 38 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20160411 | 0-00000000-0 | 139.31779 | 35.41358 |
| 2 | CS021303000007 | 石倉 俊二 | 0 | 男性 | 1987-07-04 | 31 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150707 | 0-00000000-0 | 139.31779 | 35.41358 |
| 3 | CS021313000183 | 草野 未來 | 1 | 女性 | 1986-12-21 | 32 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20170611 | 0-00000000-0 | 139.31779 | 35.41358 |
| 4 | CS021314000098 | 筒井 れいな | 1 | 女性 | 1985-09-21 | 33 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20160901 | 0-00000000-0 | 139.31779 | 35.41358 |
| 5 | CS021314000093 | 江藤 美佐 | 1 | 女性 | 1986-06-03 | 32 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20151015 | 0-00000000-0 | 139.31779 | 35.41358 |
| 6 | CS021413000049 | 大野 幸子 | 1 | 女性 | 1973-04-17 | 45 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150703 | 0-00000000-0 | 139.31779 | 35.41358 |
| 7 | CS037613000071 | 六角 雅彦 | 9 | 不明 | 1952-04-01 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 | 139.83502 | 35.67193 |
| 8 | CS015415000209 | 大谷 倫子 | 1 | 女性 | 1970-11-25 | 48 | 136-0076 | 東京都江東区南砂********** | S13015 | 20150909 | B-20090610-C | 139.83502 | 35.67193 |
| 9 | CS037614000045 | 長沢 麻緒 | 1 | 女性 | 1952-06-19 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150209 | 5-20091207-6 | 139.83502 | 35.67193 |
Commentary :
This code performs data processing on two Pandas DataFrames calleddf_geocodeanddf_customer. Here's a step-by-step explanation of what the code does:
The first line creates a new DataFrame calleddf_geocode_1by grouping the rows ofdf_geocodebypostal_cd(postal code) and calculating the mean longitude and latitude for each group. The resulting DataFrame has three columns:postal_cd,m_longitude, andm_latitude.
The second line creates a new DataFrame calleddf_customer_1by mergingdf_customeranddf_geocode_1on thepostal_cdcolumn. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns ofdf_customerplus them_longitudeandm_latitudecolumns fromdf_geocode_1.
The third line displays the first 10 rows of the resulting DataFrame (df_customer_1) using the.head(10)method. This shows a sample of the merged data that includes customer information with their corresponding postal code's mean longitude and latitude values.
P-086: For the customer data with latitude(latitude) and longitude(longitude) created in 085, combine it with shop data (df_store) using the member application shop code (application_store_cd) as key. Calculate the distance (in km) between the application shop and the customer address using the latitude(latitude) and longitude information (longitude) of the application shop and the latitude and longitude of the customer address (address), and display it together with the customer ID (customer_id), customer address and shop address (address). The following simple formula should be used as the calculation formula, but libraries using other highly accurate methods are also acceptable. Display 10 results.
Latitude(radians):φLongitude(radians):λDistance L=6371∗arccos(sinφ1∗sinφ2+cosφ1∗cosφ2∗cos(λ1−λ2))
# Code example 1
def calc_distance(x1, y1, x2, y2):
distance = 6371 * math.acos(math.sin(math.radians(x1))
* math.sin(math.radians(x2))
+ math.cos(math.radians(x1))
* math.cos(math.radians(x2))
* math.cos(math.radians(y1) - math.radians(y2)))
return distance
df_tmp = pd.merge(df_customer_1, df_store,
how='inner',
left_on='application_store_cd',
right_on='store_cd') \
.rename(columns={'address_x':'customer_address',
'address_y':'store_address'})
df_tmp['distance'] = df_tmp[['m_latitude',
'm_longitude',
'latitude',
'longitude']] \
.apply(lambda x: calc_distance(x[0], x[1], x[2], x[3]),
axis=1)
df_tmp[['customer_id', 'customer_address',
'store_address', 'distance']].head(10)
| customer_id | customer_address | store_address | distance | |
|---|---|---|---|---|
| 0 | CS021313000114 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 1 | CS021303000023 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 2 | CS021303000007 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 3 | CS021313000183 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 4 | CS021314000098 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 5 | CS021314000093 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 6 | CS021413000049 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 7 | CS021313000025 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
| 8 | CS021413000022 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
| 9 | CS021413000094 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
Commentary :
This code performs some geospatial calculations and data processing on three Pandas DataFrames calleddf_customer_1,df_store, anddf_tmp. Here's a step-by-step explanation of what the code does:
The first line defines a Python function calledcalc_distancethat takes four arguments:x1andy1(latitude and longitude of point 1) andx2andy2(latitude and longitude of point 2). The function uses the Haversine formula to calculate the distance between the two points in kilometers, assuming a spherical earth with radius of 6371 km. The result is returned asdistance.
The second line creates a new DataFrame calleddf_tmpby mergingdf_customer_1anddf_storeon theapplication_store_cdandstore_cdcolumns, respectively. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns of both DataFrames, plus some renamed columns:address_xis renamed tocustomer_address, andaddress_yis renamed tostore_address.
The third line creates a new column indf_tmpcalleddistanceby applying thecalc_distancefunction to them_latitude,m_longitude,latitude, andlongitudecolumns using the.apply()method. The.apply()method is used with theaxis=1argument to apply the function row-wise. The resulting distance value is stored in thedistancecolumn.
The fourth line selects a subset of columns fromdf_tmp(customer_id,customer_address,store_address, anddistance) using double square brackets and displays the first 10 rows of the resulting DataFrame using the.head(10)method. This shows a sample of the merged data that includes customer and store information with the corresponding distances between them.
# Code example 2
def calc_distance_numpy(x1, y1, x2, y2):
x1_r = np.radians(x1)
x2_r = np.radians(x2)
y1_r = np.radians(y1)
y2_r = np.radians(y2)
return 6371 * np.arccos(np.sin(x1_r) * np.sin(x2_r)
+ np.cos(x1_r) * np.cos(x2_r)
* np.cos(y1_r - y2_r))
df_tmp = df_customer_1.merge(df_store,
how='inner',
left_on='application_store_cd',
right_on='store_cd') \
.rename(columns={'address_x':'customer_address',
'address_y':'store_address'})
df_tmp['distance'] = calc_distance_numpy(df_tmp['m_latitude'],
df_tmp['m_longitude'],
df_tmp['latitude'],
df_tmp['longitude'])
df_tmp[['customer_id', 'customer_address',
'store_address', 'distance']].head(10)
| customer_id | customer_address | store_address | distance | |
|---|---|---|---|---|
| 0 | CS021313000114 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 1 | CS021303000023 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 2 | CS021303000007 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 3 | CS021313000183 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 4 | CS021314000098 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 5 | CS021314000093 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 6 | CS021413000049 | 神奈川県伊勢原市粟窪********** | 神奈川県伊勢原市伊勢原四丁目 | 1.394409 |
| 7 | CS021313000025 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
| 8 | CS021413000022 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
| 9 | CS021413000094 | 神奈川県伊勢原市伊勢原********** | 神奈川県伊勢原市伊勢原四丁目 | 0.474282 |
Commentary :
This code performs some geospatial calculations and data processing on three Pandas DataFrames calleddf_customer_1,df_store, anddf_tmp. Here's a step-by-step explanation of what the code does:
The first line defines a Python function calledcalc_distance_numpythat takes four arguments:x1andy1(latitude and longitude of point 1) andx2andy2(latitude and longitude of point 2). The function uses the Haversine formula to calculate the distance between the two points in kilometers, assuming a spherical earth with radius of 6371 km. This implementation uses NumPy functions to perform the calculations more efficiently than the previous implementation.
The second line creates a new DataFrame calleddf_tmpby mergingdf_customer_1anddf_storeon theapplication_store_cdandstore_cdcolumns, respectively. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns of both DataFrames, plus some renamed columns:address_xis renamed tocustomer_address, andaddress_yis renamed tostore_address.
The third line creates a new column indf_tmpcalleddistanceby calling thecalc_distance_numpyfunction on four columns ofdf_tmpusing NumPy array operations. Instead of using the.apply()method, this implementation takes advantage of NumPy's ability to perform element-wise computations on arrays. The resulting distance value is stored in thedistancecolumn.
The fourth line selects a subset of columns fromdf_tmp(customer_id,customer_address,store_address, anddistance) using double square brackets and displays the first 10 rows of the resulting DataFrame using the.head(10)method. This shows a sample of the merged data that includes customer and store information with the corresponding distances between them.
P-087: In the customer data (df_customer), the same customer is registered more than once, e.g. due to applications at different shops. Create the named customer data, in which customers with the same name (customer_name) and postcode (postal_cd) are regarded as the same customer and are collated so that there is one record per customer, and calculate the number of customer data, the number of named customer data and the number of duplicates. However, for the same customer, the one with the highest total sales amount shall be retained, and for customers with the same total sales amount or no sales record, the one with the lowest customer ID (customer_id) number shall be retained.
df_receipt_tmp = df_receipt.groupby('customer_id') \
.agg(sum_amount=('amount','sum')).reset_index()
df_customer_u = pd.merge(df_customer, df_receipt_tmp,
how='left',
on='customer_id')
df_customer_u['sum_amount'] = df_customer_u['sum_amount'].fillna(0)
df_customer_u = df_customer_u.sort_values(['sum_amount', 'customer_id'],
ascending=[False, True])
df_customer_u.drop_duplicates(subset=['customer_name', 'postal_cd'],
keep='first', inplace=True)
print('df_customer_cnt:', len(df_customer),
'df_customer_u_cnt:', len(df_customer_u),
'diff:', len(df_customer) - len(df_customer_u))
df_customer_cnt: 21971
df_customer_u_cnt: 21941
diff: 30
Commentary :
This code performs the following tasks:
Group thedf_receiptdataframe bycustomer_idand calculate the sum ofamountfor each customer. The resulting dataframe is stored indf_receipt_tmp.
Merge thedf_customerdataframe with thedf_receipt_tmpdataframe based on thecustomer_idcolumn. This is a left join, so all customers indf_customerwill be included in the resulting dataframedf_customer_u.
Replace any missing values in thesum_amountcolumn with 0.
Sort the rows indf_customer_ubysum_amountin descending order, and then bycustomer_idin ascending order.
Remove any duplicates indf_customer_ubased on thecustomer_nameandpostal_cdcolumns, keeping only the first occurrence of each duplicate. The resulting dataframe is stored back indf_customer_u.
Print out the number of rows in the originaldf_customerdataframe (df_customer_cnt), the number of rows in the updateddf_customer_udataframe (df_customer_u_cnt), and the difference between the two (diff).
Overall, this code is performing data cleaning and manipulation to create a new dataframedf_customer_uthat includes information about each customer's total purchase amount (sum_amount), sorted by this amount in descending order, with any duplicates removed based on thecustomer_nameandpostal_cdcolumns.
P-088: Based on the data created in 087, create data with integrated name IDs for the customer data. However, the integrated name ID shall be assigned according to the following specifications.
- Non-duplicated customers: set customer ID (customer_id)
- Duplicate customers: set the customer ID of the record extracted in the previous question
The difference between the number of unique customer IDs and the number of unique integration name-drop IDs shall also be checked.
df_customer_n = pd.merge(df_customer,
df_customer_u[['customer_name',
'postal_cd', 'customer_id']],
how='inner', on =['customer_name', 'postal_cd'])
df_customer_n.rename(columns={'customer_id_x':'customer_id',
'customer_id_y':'integration_id'}, inplace=True)
print('Difference in number of IDs', len(df_customer_n['customer_id'].unique())
- len(df_customer_n['integration_id'].unique()))
Difference in number of IDs 30
Commentary :
This code is merging two dataframes,df_customeranddf_customer_u, to identify and match unique customers based on their name and postal code.
The resulting merged dataframe is stored in a new dataframe calleddf_customer_n. Themergefunction is used to join the two dataframes, and the argumenton=['customer_name', 'postal_cd']specifies that the merge should be performed based on these two columns.
The resulting merged dataframe contains two customer ID columns,customer_id_xandcustomer_id_y. To avoid confusion, therenamefunction is used to rename these columns tocustomer_idandintegration_id, respectively.
The last line of code compares the number of unique customer IDs in thecustomer_idcolumn with the number of unique customer IDs in theintegration_idcolumn. If the difference is zero, it means that there are no duplicate customers in the merged dataframe. If the difference is non-zero, it means that there are still some customers that were not successfully matched based on their name and postal code, and their IDs are still duplicated.
P – Quiet: df_customer_1, df_customer_n are not used and should be deleted.
del df_customer_1
del df_customer_n
P-089: You want to split customers with sales records into training data and test data for building a forecasting model. Split the data randomly in the ratio of 8:2 respectively.
df_sales_customer = df_receipt.groupby('customer_id') \
.agg({'amount':sum}).reset_index()
df_sales_customer = df_sales_customer.query('amount > 0')
df_tmp = pd.merge(df_customer, df_sales_customer['customer_id'],
how='inner', on='customer_id')
df_train, df_test = train_test_split(df_tmp, test_size=0.2, random_state=71)
print('Percentage of training data: ', len(df_train) / len(df_tmp))
print('Test Data Percentage: ', len(df_test) / len(df_tmp))
Percentage of training data: 0.7999036840837949
Test Data Percentage: 0.20009631591620516
Commentary :
This code performs a train-test split of the data based on the sales information of each customer.
First, it aggregates the sales information of each customer from thedf_receiptdataframe by grouping the dataframe bycustomer_idand summing theamountcolumn. The resulting dataframe is stored indf_sales_customer.
Next, it filters out the customers who did not make any purchases (i.e., customers withamountequal to 0) using thequerymethod, and updatesdf_sales_customerto only include these customers.
Then, it performs an inner join ofdf_customerwith thecustomer_idcolumn ofdf_sales_customer, to only keep customers who made purchases. The resulting dataframe is stored indf_tmp.
Finally, it splitsdf_tmpinto training and test sets usingtrain_test_splitmethod from scikit-learn, with a test size of 0.2 and a random state of 71. It then prints the percentage of data that is in the training and test sets.
P-090: The receipt details data (df_receipt) has data from 1 Jan 2017 to 31 Oct 2019. Create three sets of data for building time-series models, 12 months for training and 6 months for testing, by aggregating the sales amount (amount) on a monthly basis.
# Code example 1 (home-made function)
df_ts_amount = df_receipt[['sales_ymd', 'amount']].copy()
df_ts_amount['sales_ym'] = df_ts_amount['sales_ymd'].astype('str').str[0:6]
df_ts_amount = df_ts_amount.groupby('sales_ym') \
.agg({'amount':'sum'}).reset_index()
# Functionalise large numbers of data sets for long-term data so that they can be processed, e.g. in loops.
def split_data(df, train_size, test_size, slide_window, start_point):
train_start = start_point * slide_window
test_start = train_start + train_size
return df[train_start:test_start], df[test_start:test_start + test_size]
df_train_1, df_test_1 = split_data(df_ts_amount, train_size=12,
test_size=6, slide_window=6, start_point=0)
df_train_2, df_test_2 = split_data(df_ts_amount, train_size=12,
test_size=6, slide_window=6, start_point=1)
df_train_3, df_test_3 = split_data(df_ts_amount, train_size=12,
test_size=6, slide_window=6, start_point=2)
# df_train_2 and df_train_3 display omitted
df_train_1
| sales_ym | amount | |
|---|---|---|
| 0 | 201701 | 902056 |
| 1 | 201702 | 764413 |
| 2 | 201703 | 962945 |
| 3 | 201704 | 847566 |
| 4 | 201705 | 884010 |
| 5 | 201706 | 894242 |
| 6 | 201707 | 959205 |
| 7 | 201708 | 954836 |
| 8 | 201709 | 902037 |
| 9 | 201710 | 905739 |
| 10 | 201711 | 932157 |
| 11 | 201712 | 939654 |
# df_test_2 and df_test_3 display omitted
df_test_1
| sales_ym | amount | |
|---|---|---|
| 12 | 201801 | 944509 |
| 13 | 201802 | 864128 |
| 14 | 201803 | 946588 |
| 15 | 201804 | 937099 |
| 16 | 201805 | 1004438 |
| 17 | 201806 | 1012329 |
Commentary :
The code is performing the following tasks:df_ts_amountdataframe is created with columns 'sales_ymd', 'amount', and 'sales_ym'.sales_ymcolumn is created from the first six characters of thesales_ymdcolumn converted to a string.
Grouping thedf_ts_amountdataframe by 'sales_ym' column and aggregating the 'amount' column with 'sum' function, and resetting the index. The resulting dataframe is stored in the same variable 'df_ts_amount'.
A functionsplit_datais defined that takes in a dataframedf, train size, test size, slide window, and start point as arguments. It returns two dataframes - train and test dataframes.
Thesplit_datafunction is called three times with different arguments each time to split thedf_ts_amountdataframe into train and test dataframes. Three sets of train and test dataframes are created and stored indf_train_1,df_test_1,df_train_2,df_test_2,df_train_3, anddf_test_3.
The purpose of splitting the data into multiple sets is probably to use them as training and validation sets for a time series forecasting model. By splitting the data into different sets, the model can be trained on different time periods, and the performance can be evaluated on different test periods. This helps to avoid overfitting and to get a more accurate estimate of the model's performance.
# Code example 2 (TimeSeriesSplit in scikit-learn)
tscv = TimeSeriesSplit(gap=0, max_train_size=12, n_splits=3, test_size=6)
# TimeSeriesSplit is split so that the most recent data is used, but.
# Adjust the data period so that it is the same as in the SQL and R solution examples
# Unnecessary if you want to use the latest data possible
df_ts_amount = df_ts_amount.query('sales_ym <= "201906"')
series_list = []
for train_index, test_index in tscv.split(df_ts_amount):
series_list.append((df_ts_amount.loc[train_index],
df_ts_amount.loc[test_index]))
df_train_1, df_test_1 = series_list[0]
df_train_2, df_test_2 = series_list[1]
df_train_3, df_test_3 = series_list[2]
# df_train_2 and df_train_3 display omitted
df_train_1
| sales_ym | amount | |
|---|---|---|
| 0 | 201701 | 902056 |
| 1 | 201702 | 764413 |
| 2 | 201703 | 962945 |
| 3 | 201704 | 847566 |
| 4 | 201705 | 884010 |
| 5 | 201706 | 894242 |
| 6 | 201707 | 959205 |
| 7 | 201708 | 954836 |
| 8 | 201709 | 902037 |
| 9 | 201710 | 905739 |
| 10 | 201711 | 932157 |
| 11 | 201712 | 939654 |
# df_test_2 and df_test_3 display omitted
df_test_1
| sales_ym | amount | |
|---|---|---|
| 12 | 201801 | 944509 |
| 13 | 201802 | 864128 |
| 14 | 201803 | 946588 |
| 15 | 201804 | 937099 |
| 16 | 201805 | 1004438 |
| 17 | 201806 | 1012329 |
Commentary :
This code is performing time series cross-validation on a pandas DataFramedf_ts_amount. It uses theTimeSeriesSplitfunction from thesklearn.model_selectionmodule to generate three train-test splits.TimeSeriesSplitis a cross-validator that provides train/test indices to split time series data samples. It takes four parameters:n_splits: the number of splitsmax_train_size: maximum size of training datatest_size: size of the test datagap: the number of samples to skip before starting a new split
The code setsn_splits=3,max_train_size=12,test_size=6, andgap=0to create three splits with a training period of 12 months and a testing period of 6 months.
The code then filters the data to only include sales data up to June 2019 (df_ts_amount.query('sales_ym <= "201906"')) and creates a listseries_listto store the train and test sets generated byTimeSeriesSplit. The for loop iterates over the splits generated byTimeSeriesSplitand appends each split toseries_list. Finally, the code assigns the first train-test split todf_train_1anddf_test_1, the second split todf_train_2anddf_test_2, and the third split todf_train_3anddf_test_3.
P-091: For each customer in the customer data (df_customer), under-sample the number of customers with sales performance and the number of customers without sales performance so that the ratio is 1:1.
df_tmp = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_tmp = pd.merge(df_customer, df_tmp, how='left', on='customer_id')
df_tmp['is_buy_flag'] = np.where(df_tmp['amount'].isnull(), 0, 1)
rs = RandomUnderSampler(random_state=71)
df_down_sampling, _ = rs.fit_resample(df_tmp, df_tmp.is_buy_flag)
print('Number of cases 0', len(df_down_sampling.query('is_buy_flag == 0')))
print('Number of cases 1', len(df_down_sampling.query('is_buy_flag == 1')))
Number of cases 0 8306 Number of cases 1 8306
Commentary :
This code performs random under-sampling to balance the binary target variableis_buy_flag.
First, the total purchase amount for each customer is computed using groupby and is stored indf_tmp. Then,df_customeranddf_tmpare merged using a left join to include all customers, even if they have not made any purchases. Theis_buy_flagcolumn is created by using np.where to assign 1 to customers who made purchases and 0 to those who did not.
Next, theRandomUnderSamplerfunction from theimblearnlibrary is used to balance the target variable. Thefit_resamplemethod is applied todf_tmpwith theis_buy_flagcolumn as the target variable. This creates a new DataFrame,df_down_sampling, with an equal number of cases for both target variable values.
Finally, the number of cases withis_buy_flagvalues of 0 and 1 are printed to confirm that the random under-sampling was successful.
P-092: Normalise to third normal form for gender in customer data (df_customer).
df_gender_std = df_customer[['gender_cd', 'gender']].drop_duplicates()
df_customer_std = df_customer.drop(columns='gender')
# Check data content
df_customer_std.head(3)
| customer_id | customer_name | gender_cd | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CS021313000114 | 大野 あや子 | 1 | 1981-04-29 | 37 | 259-1113 | 神奈川県伊勢原市粟窪********** | S14021 | 20150905 | 0-00000000-0 |
| 1 | CS037613000071 | 六角 雅彦 | 9 | 1952-04-01 | 66 | 136-0076 | 東京都江東区南砂********** | S13037 | 20150414 | 0-00000000-0 |
| 2 | CS031415000172 | 宇多田 貴美子 | 1 | 1976-10-04 | 42 | 151-0053 | 東京都渋谷区代々木********** | S13031 | 20150529 | D-20100325-C |
# Check data content
df_gender_std.head(3)
| gender_cd | gender | |
|---|---|---|
| 0 | 1 | 女性 |
| 1 | 9 | 不明 |
| 5 | 0 | 男性 |
Commentary :
In this code,df_gender_stdis created by selecting the 'gender_cd' and 'gender' columns from thedf_customerdataframe, and then removing duplicate rows using thedrop_duplicates()method. This creates a new dataframe with unique combinations of 'gender_cd' and 'gender' values.df_customer_stdis created by dropping the 'gender' column from thedf_customerdataframe using thedrop()method with thecolumnsparameter set to 'gender'. This creates a new dataframe with all the columns fromdf_customerexcept the 'gender' column.
The purpose of this code is to create two separate dataframes, one with unique combinations of 'gender_cd' and 'gender', and another with all columns fromdf_customerexcept 'gender'. These dataframes can then be used for further analysis or modeling.
P-093: Product data (df_product) only holds the code values for each category, but not the category names. Combine with category data (df_category) and denormalise to create new product data that holds category names.
df_product_full = pd.merge(df_product, df_category[['category_small_cd',
'category_major_name',
'category_medium_name',
'category_small_name']],
how = 'inner', on = 'category_small_cd')
# Check data content
df_product_full.head(3)
| product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
| 1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
| 2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
This code performs an inner join between two dataframesdf_productanddf_categoryon the columncategory_small_cd. The resulting dataframedf_product_fullcontains all the columns fromdf_productanddf_category, and the join is based on the matching values ofcategory_small_cd.
The new dataframedf_product_fullalso contains additional columns fromdf_category, specificallycategory_major_name,category_medium_name, andcategory_small_name. These columns contain categorical information about the product category, and they are included in the merged dataframe using thehow='inner'option, which keeps only the rows that have matching values in both dataframes. The.head(3)method is used to display the first 3 rows of the resulting merged dataframe.
P-094: File out the product data with category names created in 093 with the following specifications.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|
The path of the file output destination shall be as follows
|Output| |:–:| |./data|
# Code example 1.
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_UTF-8_header.csv',
encoding='UTF-8', index=False)
Commentary :
This code exports the Pandas DataFramedf_product_fullto a CSV file with a specified file path and name../data/P_df_product_full_UTF-8_header.csv. The exported CSV file will use UTF-8 encoding to support a wide range of characters, and the argumentindex=Falsespecifies that the row index of the DataFrame should not be included in the exported CSV file.
# Code example 2 (with BOM to prevent garbled Excel)
df_product_full.to_csv('../data/P_df_product_full_UTF-8BOM_header.csv',
encoding='utf_8_sig', index=False)
Commentary :
This line of code exports thedf_product_fullDataFrame to a CSV file with a filenameP_df_product_full_UTF-8BOM_header.csv.
Theto_csv()method of a pandas DataFrame is used to write the contents of the DataFrame to a CSV file. It has several parameters, including the filename and the encoding to use.
In this case, thefilenameparameter is set to'../data/P_df_product_full_UTF-8BOM_header.csv', which specifies the directory and filename of the CSV file. The'../data/'prefix specifies the relative directory in which the file should be saved.
Theencodingparameter is set to'utf_8_sig', which specifies the encoding format to use for the CSV file.'utf_8_sig'is a variant of UTF-8 encoding that includes a byte order mark (BOM) at the beginning of the file. The BOM is a special marker that indicates the byte order of the file, and can help programs determine the encoding format when opening the file.
Theindexparameter is set toFalse, which specifies that the DataFrame index should not be included in the CSV file.
P-095: Output the product data with category names created in 093 to a file with the following specifications.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|CP932|
The path of the file output destination shall be as follows
|Output| |:–:| |./data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_CP932_header.csv',
encoding='CP932', index=False)
Commentary :
This line of code exports a pandas DataFramedf_product_fullto a CSV file named "P_df_product_full_CP932_header.csv" in the "../data/" directory.
The exported CSV file uses the "CP932" character encoding, which is a Japanese character encoding also known as "Windows-31J". This encoding is commonly used in Japan for text data and is compatible with Windows operating systems.
The parameterencoding='CP932'specifies the character encoding to use when exporting the CSV file.
The parameterindex=Falsespecifies that the index column should not be included in the exported CSV file.
P-096: Output the product data with category names created in 093 to a file with the following specifications.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|without|UTF-8|
The path of the file output destination shall be as follows
|Output| |:–:| |./data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_UTF-8_noh.csv',
header=False, encoding='UTF-8', index=False)
Commentary :
This code saves thedf_product_fullDataFrame to a CSV file called "P_df_product_full_UTF-8_noh.csv" in the "../data/" directory.
The first argument specifies the filename and path where the CSV file will be saved. Theheaderargument is set toFalse, which means that the header row of the DataFrame will not be included in the output file. Theencodingargument is set to'UTF-8', which specifies the character encoding to use when writing the file. Finally,indexis set toFalse, which means that the index of the DataFrame will not be included in the output file.
Overall, this code exports thedf_product_fullDataFrame to a CSV file without the header row and with UTF-8 encoding.
P-097: Load the file in the following format created in 094, display three data items and check that they have been imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_header.csv',
dtype={'category_major_cd':str,
'category_medium_cd':str,
'category_small_cd':str},
encoding='UTF-8')
df_product_full.head(3)
| product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
| 1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
| 2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code above reads a CSV fileP_df_product_full_UTF-8_noh.csvwith UTF-8 encoding and no header row, and assigns column names to the dataframe.pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv', dtype={1:str, 2:str, 3:str}, encoding='UTF-8', header=None): This reads the CSV file without a header row, and sets the second, third and fourth columns to string data type. The resulting dataframe has columns named 0, 1, 2, 3, 4, 5, 6, 7 and 8.df_product_full.columns = ['product_cd','category_major_cd', 'category_medium_cd', 'category_small_cd', 'unit_price','unit_cost','category_major_name', 'category_medium_name', 'category_small_name']: This assigns new column names to the dataframe columns. The resulting dataframe has columns namedproduct_cd,category_major_cd,category_medium_cd,category_small_cd,unit_price,unit_cost,category_major_name,category_medium_name, andcategory_small_name.df_product_full.head(3): This displays the first 3 rows of the dataframedf_product_full.
P-098: Load the file in the following format created in 096, display three data items and check that they are imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|Without|UTF-8|
# Code example 1 (item names to be added later).
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv',
dtype={1:str,
2:str,
3:str},
encoding='UTF-8', header=None)
df_product_full.columns = ['product_cd','category_major_cd',
'category_medium_cd', 'category_small_cd',
'unit_price','unit_cost','category_major_name',
'category_medium_name', 'category_small_name']
df_product_full.head(3)
| product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
| 1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
| 2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
This code reads a CSV file named 'P_df_product_full_UTF-8_noh.csv' located in the '../data' directory. It specifies the column names using the 'names' parameter as a list of strings called 'c_names'. The 'dtype' parameter is used to specify the data type of the columns named 'category_major_cd', 'category_medium_cd', and 'category_small_cd' as 'str' (string). The 'encoding' parameter specifies that the CSV file is encoded using 'UTF-8'. Finally, the 'header' parameter is set to 'None' to indicate that the file does not have a header row.
The resulting dataframe is assigned to the variable 'df_product_full' and the first three rows are displayed using the 'head' method.
# Code example 2 (define item name first)
c_names = ['product_cd','category_major_cd','category_medium_cd',
'category_small_cd','unit_price','unit_cost',
'category_major_name','category_medium_name','category_small_name']
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv',
names=c_names,
dtype={'category_major_cd':str,
'category_medium_cd':str,
'category_small_cd':str},
encoding='UTF-8', header=None)
df_product_full.head(3)
| product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
| 1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
| 2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
This code reads a CSV file into a Pandas DataFrame with customized column names using theread_csvfunction.
Here is an explanation of the code:c_names: A list of strings that define the customized column names.pd.read_csv(): A Pandas function to read a CSV file and return a DataFrame.'../data/P_df_product_full_UTF-8_noh.csv': The file path of the CSV file to be read.names=c_names: Specifies the customized column names using thec_nameslist.dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}: Specifies that the data type of the columnscategory_major_cd,category_medium_cd, andcategory_small_cdshould bestr.encoding='UTF-8': Specifies that the file encoding should be 'UTF-8'.header=None: Specifies that there is no header row in the CSV file.
Finally, the resulting DataFrame is displayed using thehead()method to show the first three rows.
P-099: Output the product data with category names created in 093 to a file with the following specifications.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|
The path of the file output destination shall be as follows
|Output| |:–:| |./data|
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../d/P_df_product_full_UTF-8_header.tsv',
sep='\t', encoding='UTF-8', index=False)
Commentary :
The code is writing the contents of a pandas DataFrame calleddf_product_fullto a TSV (Tab-Separated Values) file located at'../d/P_df_product_full_UTF-8_header.tsv'.
The.to_csv()method is called on thedf_product_fullDataFrame, which is a method provided by pandas for exporting data to a CSV (Comma-Separated Values) or TSV file.
The first argument to the method,'../d/P_df_product_full_UTF-8_header.tsv', is the file path and name where the DataFrame should be saved. The file will be saved in the../d/directory with the nameP_df_product_full_UTF-8_header.tsv.
The second argument,sep='\t', specifies that the values in the file should be separated by tabs instead of commas.
The third argument,encoding='UTF-8', specifies that the encoding of the file should be UTF-8, which is a standard encoding for Unicode characters.
The fourth argument,index=False, specifies that the DataFrame's index should not be included in the output file.
P-100: Load the file in the following format created in 099, display three items of data and check that they have been imported correctly.
|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|
# Code example 1 (read_table).
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_table('../data/P_df_product_full_UTF-8_header.tsv',
dtype={'category_major_cd':str,
'category_medium_cd':str,
'category_small_cd':str},
encoding='UTF-8')
df_product_full.head(3)
| product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
| 1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
| 2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code reads data from a TSV (Tab-Separated Values) file located at'../data/P_df_product_full_UTF-8_header.tsv'into a pandas DataFrame calleddf_product_full.
Thepd.read_table()function is called to read the data from the file. The first argument to the function,'../data/P_df_product_full_UTF-8_header.tsv', specifies the path and name of the file to read.
The second argument to the function,dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}, specifies the data types for certain columns in the DataFrame. Specifically, the columnscategory_major_cd,category_medium_cd, andcategory_small_cdare being set to be of type string (str).
The third argument to the function,encoding='UTF-8', specifies that the encoding of the file is UTF-8.
The resulting DataFrame,df_product_full, is then displayed with thehead(3)method, which shows the first three rows of the DataFrame.
# Code example 2 (read_csv)
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_header.tsv',
dtype={'category_major_cd':str,
'category_medium_cd':str,
'category_small_cd':str},
sep='\t', encoding='UTF-8')
df_product_full.head(3)
| product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost | category_major_name | category_medium_name | category_small_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | P040101001 | 04 | 0401 | 040101 | 198.0 | 149.0 | 惣菜 | 御飯類 | 弁当類 |
| 1 | P040101002 | 04 | 0401 | 040101 | 218.0 | 164.0 | 惣菜 | 御飯類 | 弁当類 |
| 2 | P040101003 | 04 | 0401 | 040101 | 230.0 | 173.0 | 惣菜 | 御飯類 | 弁当類 |
Commentary :
The code reads data from a TSV (Tab-Separated Values) file located at'../data/P_df_product_full_UTF-8_header.tsv'into a pandas DataFrame calleddf_product_full.
Thepd.read_csv()function is called to read the data from the file. The first argument to the function,'../data/P_df_product_full_UTF-8_header.tsv', specifies the path and name of the file to read.
The second argument to the function,dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}, specifies the data types for certain columns in the DataFrame. Specifically, the columnscategory_major_cd,category_medium_cd, andcategory_small_cdare being set to be of type string (str).
The third argument to the function,sep='\t', specifies that the values in the file are separated by tabs instead of commas.
The fourth argument to the function,encoding='UTF-8', specifies that the encoding of the file is UTF-8.
The resulting DataFrame,df_product_full, is then displayed with thehead(3)method, which shows the first three rows of the DataFrame.
This is the end of the 100. Well done!









Comment