参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_Python.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Getting Started

First, run the following cells
Import the necessary libraries and read data from the database (PostgreSQL)
The following cell imports libraries that are expected to be used, such as pandas.
If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
The process can be divided into multiple times.
Names, addresses, etc. are dummy data and are not real.

In [ ]:

import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler


if 'PG_PORT' in os.environ:
    pgconfig = {
        'host': 'db',
        'port': os.environ['PG_PORT'],
        'database': os.environ['PG_DATABASE'],
        'user': os.environ['PG_USER'],
        'password': os.environ['PG_PASSWORD'],
    }


    # Connector for pd.read_sql
    conn = psycopg2.connect(**pgconfig)

    df_customer = pd.read_sql(sql='select * from customer', con=conn)
    df_category = pd.read_sql(sql='select * from category', con=conn)
    df_product = pd.read_sql(sql='select * from product', con=conn)
    df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
    df_store = pd.read_sql(sql='select * from store', con=conn)
    df_geocode = pd.read_sql(sql='select * from geocode', con=conn)

else:
    if not os.path.exists('data/'):
        !git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
        os.chdir('100knocks-preprocess/docker/work/')

    dtype = {
        'customer_id': str,
        'gender_cd': str,
        'postal_cd': str,
        'application_store_cd': str,
        'status_cd': str,
        'category_major_cd': str,
        'category_medium_cd': str,
        'category_small_cd': str,
        'product_cd': str,
        'store_cd': str,
        'prefecture_cd': str,
        'tel_no': str,
        'postal_cd': str,
        'street': str
    }

    df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
    df_category = pd.read_csv("data/category.csv", dtype=dtype)
    df_product = pd.read_csv("data/product.csv", dtype=dtype)
    df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
    df_store = pd.read_csv("data/store.csv", dtype=dtype)
    df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)

fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.

Exercise

P-081: Create new product data for the missing values of unit_price（unit_price） and unit_cost（unit_cost）,supplemented by the average value of each. Note that for the average values, round off values less than one yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.

In [ ]:

# Code example 1 (fillna in Pandas)
df_product_2 = df_product.fillna({
    'unit_price':np.round(np.nanmean(df_product['unit_price'])), 
    'unit_cost':np.round(np.nanmean(df_product['unit_cost']))})

df_product_2.isnull().sum()

Out[ ]:

product_cd            0
category_major_cd     0
category_medium_cd    0
category_small_cd     0
unit_price            0
unit_cost             0
dtype: int64

Commentary :

This code performs two operations on a pandas DataFrame named df_product:

It fills missing values in the unit_price and unit_cost columns with the rounded mean of each column using the fillna() method. The rounded mean is calculated using the np.round() and np.nanmean() functions from the NumPy library. The fillna() method takes a dictionary as its argument, where the keys are the column names and the values are the values to fill missing values with.

It then checks for any remaining missing values in the DataFrame using the isnull() method, and sums up the number of missing values in each column using the sum() method. The resulting output is the number of missing values in each column of the updated DataFrame df_product_2.

The updated DataFrame df_product_2 will have missing values in the unit_price and unit_cost columns replaced with their respective rounded mean values, and the output of the second line will show how many missing values remain in the DataFrame after the fillna operation.

In [ ]:

# Code example 2 (SimpleImputer in scikit-learn)
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

imp_values = imp_mean.fit_transform(df_product[['unit_price', 'unit_cost']])

df_product_2 = df_product.copy()

df_product_2[['unit_price', 'unit_cost']] = imp_values.round()

df_product_2.isnull().sum()

Out[ ]:

product_cd            0
category_major_cd     0
category_medium_cd    0
category_small_cd     0
unit_price            0
unit_cost             0
dtype: int64

Commentary :

This code performs the following operations:

It imports the SimpleImputer class from the sklearn.impute module. The SimpleImputer class is used for imputing missing values in a dataset.

It creates an instance of SimpleImputer called imp_mean. This imputer will replace missing values (which are represented as np.nan in this case) with the mean value of the corresponding column. The strategy for imputation is set to 'mean'.

It applies the SimpleImputer object to the unit_price and unit_cost columns of the df_product DataFrame using the fit_transform() method, which returns the transformed DataFrame with missing values imputed.

It creates a copy of the original df_product DataFrame called df_product_2 using the copy() method.

It replaces the unit_price and unit_cost columns of df_product_2 with the imputed values by assigning imp_values.round() to those columns. The round() method is used to round the imputed values to the nearest integer.

It checks for any remaining missing values in the DataFrame using the isnull() method, and sums up the number of missing values in each column using the sum() method. The resulting output is the number of missing values in each column of the updated DataFrame df_product_2.

The final output of this code will show the number of missing values in the updated df_product_2 DataFrame after imputing missing values with the mean of each column. This approach can be useful for handling missing data in a dataset, especially if there are only a small number of missing values.

P-082: Create new product data for missing values of unit_price(unit_price) and unit_cost(unit_cost), supplemented by the median value of each. Note that for the median value, round off the figures to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no missing values for each item.

In [ ]:

# Code example 1 (fillna in Pandas)
df_product_3 = df_product.fillna({
    'unit_price':np.round(np.nanmedian(df_product['unit_price'])), 
    'unit_cost':np.round(np.nanmedian(df_product['unit_cost']))})

df_product_3.isnull().sum()

Out[ ]:

product_cd            0
category_major_cd     0
category_medium_cd    0
category_small_cd     0
unit_price            0
unit_cost             0
dtype: int64

Commentary :

The code first fills the missing values in two columns, "unit_price" and "unit_cost", of the DataFrame named "df_product". It fills the missing values using the fillna method with the following arguments:

The first argument is a dictionary with two keys, "unit_price" and "unit_cost", each mapping to a corresponding value. These values are computed using the NumPy library's nanmedian function applied to the respective columns of the DataFrame. nanmedian calculates the median value of the column after ignoring any missing values represented by NaN.

The second argument is omitted in this code, meaning that any missing values in the other columns of the DataFrame are left untouched.

The result of this operation is a new DataFrame called "df_product_3".

The next line of code checks if there are still any missing values in the "df_product_3" DataFrame by calling the isnull() method and then the sum() method. This returns a Series object that contains the number of missing values for each column of the DataFrame. If the output shows that there are no missing values in either column, it confirms that the fillna operation was successful.

In [ ]:

# Code example 2 (SimpleImputer in scikit-learn)
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')

imp_values = imp_mean.fit_transform(df_product[['unit_price', 'unit_cost']])

df_product_3 = df_product.copy()

df_product_3[['unit_price', 'unit_cost']] = imp_values.round()

df_product_3.isnull().sum()

Out[ ]:

product_cd            0
category_major_cd     0
category_medium_cd    0
category_small_cd     0
unit_price            0
unit_cost             0
dtype: int64

Commentary :

This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using scikit-learn's SimpleImputer.

The first line of the code initializes an instance of SimpleImputer named "imp_mean". This class provides a way to impute missing values with a given strategy, which in this case is the median of the non-missing values. The missing_values argument specifies the value to be considered as missing, which is NaN in this case.

The second line of the code uses the fit_transform method of the "imp_mean" object to impute missing values in the "unit_price" and "unit_cost" columns of the "df_product" DataFrame. The method returns a NumPy array named "imp_values" that contains the imputed values.

The third line of the code creates a copy of the "df_product" DataFrame named "df_product_3". This is done to avoid modifying the original DataFrame.

The fourth line of the code assigns the imputed values to the "unit_price" and "unit_cost" columns of the "df_product_3" DataFrame. The round() method is called on the "imp_values" array to round off the values to the nearest integer.

The last line of the code checks if there are still any missing values in the "df_product_3" DataFrame by calling the isnull() method and then the sum() method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.

P-083: Create new product data for the missing values of unit price(unit_price) and unit cost(unit cost), supplemented by the median value calculated for each product category sub-category code (category_small_cd). Note that the median value should be rounded to the nearest yen (rounding off or rounding to even numbers is acceptable). After completion, also check that there are no deficiencies for each item.

In [ ]:

# Code example 1
df_tmp = (df_product.groupby('category_small_cd')
          .agg(median_price=('unit_price', 'median'), 
               median_cost=('unit_cost', 'median')).reset_index())

df_product_4 = pd.merge(df_product, df_tmp, how='inner', on='category_small_cd')

df_product_4['unit_price'] = df_product_4[['unit_price', 'median_price']]. \
            apply(lambda x: np.round(x[1]) if np.isnan(x[0]) else x[0], axis=1)

df_product_4['unit_cost'] = df_product_4[['unit_cost', 'median_cost']]. \
            apply(lambda x: np.round(x[1]) if np.isnan(x[0]) else x[0], axis=1)

df_product_4.isnull().sum()

Out[ ]:

product_cd            0
category_major_cd     0
category_medium_cd    0
category_small_cd     0
unit_price            0
unit_cost             0
median_price          0
median_cost           0
dtype: int64

Commentary :

This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:

The first line groups the DataFrame "df_product" by the "category_small_cd" column and computes the median values of "unit_price" and "unit_cost" for each group. The resulting DataFrame is named "df_tmp" and has three columns: "category_small_cd", "median_price", and "median_cost".

The second line uses the merge() method of the Pandas library to merge the "df_product" and "df_tmp" DataFrames based on the "category_small_cd" column. The resulting DataFrame, "df_product_4", contains all the columns of "df_product" and the "median_price" and "median_cost" columns from "df_tmp".

The third line creates a new column in "df_product_4" named "unit_price" that is computed by applying a lambda function to the "unit_price" and "median_price" columns. The lambda function checks if the "unit_price" column is missing (i.e., NaN) and, if so, returns the rounded value of "median_price". Otherwise, it returns the original value of "unit_price". This operation fills the missing values in the "unit_price" column.

The fourth line creates a new column in "df_product_4" named "unit_cost" that is computed by applying a similar lambda function to the "unit_cost" and "median_cost" columns. This operation fills the missing values in the "unit_cost" column.

The last line checks if there are still any missing values in the "df_product_4" DataFrame by calling the isnull() method and then the sum() method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.

In [ ]:

# Code example 2 (use of MASK)
df_tmp = (df_product.groupby('category_small_cd')
          .agg(median_price=('unit_price', 'median'), 
               median_cost=('unit_cost', 'median')).reset_index())

df_product_4 = df_product.merge(df_tmp, how='inner', on='category_small_cd')

df_product_4['unit_price'] = (df_product_4['unit_price']
                              .mask(df_product_4['unit_price'].isnull(), 
                                    df_product_4['median_price'].round()))

df_product_4['unit_cost'] = (df_product_4['unit_cost']
                              .mask(df_product_4['unit_cost'].isnull(), 
                                    df_product_4['median_cost'].round()))

df_product_4.isnull().sum()

Out[ ]:

product_cd            0
category_major_cd     0
category_medium_cd    0
category_small_cd     0
unit_price            0
unit_cost             0
median_price          0
median_cost           0
dtype: int64

Commentary :

This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:

The first line groups the DataFrame "df_product" by the "category_small_cd" column and computes the median values of "unit_price" and "unit_cost" for each group. The resulting DataFrame is named "df_tmp" and has three columns: "category_small_cd", "median_price", and "median_cost".

The second line uses the merge() method of the Pandas library to merge the "df_product" and "df_tmp" DataFrames based on the "category_small_cd" column. The resulting DataFrame, "df_product_4", contains all the columns of "df_product" and the "median_price" and "median_cost" columns from "df_tmp".

The third line creates a new column in "df_product_4" named "unit_price" that is computed using the mask() method of Pandas. The mask() method sets the missing values (i.e., NaN) in "unit_price" to the rounded value of "median_price". This operation fills the missing values in the "unit_price" column.

The fourth line creates a new column in "df_product_4" named "unit_cost" that is computed in a similar way to "unit_price". This operation fills the missing values in the "unit_cost" column.

The last line checks if there are still any missing values in the "df_product_4" DataFrame by calling the isnull() method and then the sum() method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.

In [ ]:

# Code example 3 (utilising fillna and transform)
df_product_4 = df_product.copy()

for x in ['unit_price', 'unit_cost']: 
    df_product_4[x] = (df_product_4[x]
                       .fillna(df_product_4.groupby('category_small_cd')[x]
                               .transform('median')
                               .round()))

df_product_4.isnull().sum()

Out[ ]:

product_cd            0
category_major_cd     0
category_medium_cd    0
category_small_cd     0
unit_price            0
unit_cost             0
dtype: int64

Commentary :

This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:

The first line creates a copy of the original DataFrame "df_product" and names it "df_product_4".

The second line initiates a for loop to iterate over the columns "unit_price" and "unit_cost".

The third line fills the missing values in each column. It uses the fillna() method to replace the missing values in each column with the median value of the respective category. This is done by grouping "df_product_4" by the "category_small_cd" column and computing the median value of the respective column using the transform() method. The round() method is used to round the median value to the nearest integer.

The fourth line checks if there are still any missing values in the "df_product_4" DataFrame by calling the isnull() method and then the sum() method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.

Thus, the code imputes missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category, and creates a new DataFrame "df_product_4" with the imputed values. The imputation is done by copying the original DataFrame, filling missing values in each column using the median value of the respective category, and checking if there are any missing values remaining.

P-084: Prepare new data for all customers in the customer data (df_customer) by calculating the percentage of the 2019 sales value to the sales value for the whole period. However, if there are no sales results, treat them as 0. Then, extract the calculated percentages that exceed 0 and display the results for 10 cases. Also check that there are no missing data in the data created.

In [ ]:

df_receipt_2019 = df_receipt.query('20190101 <= sales_ymd <= 20191231') \
                            .groupby('customer_id') \
                            .agg(amount_2019=('amount', 'sum')) \
                            .reset_index()

df_receipt_all = df_receipt.groupby('customer_id')\
                            .agg(amount_all=('amount', 'sum')) \
                            .reset_index()

df_sales_rate = df_customer[['customer_id']] \
                    .merge(df_receipt_2019, how='left', on='customer_id') \
                    .merge(df_receipt_all, how='left', on='customer_id')

df_sales_rate['amount_2019'] = df_sales_rate['amount_2019'].fillna(0)
df_sales_rate['amount_all'] = df_sales_rate['amount_all'].fillna(0)

df_sales_rate['amount_rate'] = \
                df_sales_rate[['amount_2019','amount_all']] \
                    .apply(lambda x: 0 if x[0] == 0 else x[0] / x[1], axis=1)

df_sales_rate['amount_rate'] = df_sales_rate['amount_rate'].fillna(0)

df_sales_rate.query('amount_rate > 0').head(10)

Out[ ]:

	customer_id	amount_2019	amount_all	amount_rate
2	CS031415000172	2971.0	5088.0	0.583923
6	CS015414000103	874.0	3122.0	0.279949
12	CS011215000048	248.0	3444.0	0.072009
15	CS029415000023	3767.0	5167.0	0.729050
21	CS035415000029	5823.0	7504.0	0.775986
23	CS023513000066	208.0	771.0	0.269780
24	CS035513000134	463.0	1565.0	0.295847
27	CS001515000263	216.0	216.0	1.000000
30	CS006415000279	229.0	229.0	1.000000
32	CS031415000106	215.0	7741.0	0.027774

In [ ]:

df_tmp.isnull().sum()

Out[ ]:

category_small_cd    0
median_price         0
median_cost          0
dtype: int64

Commentary :

This code performs data processing on a Pandas DataFrame called df_receipt to calculate the sales rate of customers in the year 2019 compared to their total sales. Here's a step-by-step explanation of what the code does:

The first line creates a new DataFrame called df_receipt_2019 that filters the rows of df_receipt to only include sales from the year 2019 (20190101 <= sales_ymd <= 20191231). It then groups the rows by customer_id and calculates the sum of the amount column for each group. The resulting DataFrame has two columns: customer_id and amount_2019.

The second line creates a new DataFrame called df_receipt_all that groups the rows of df_receipt by customer_id and calculates the sum of the amount column for each group. The resulting DataFrame has two columns: customer_id and amount_all.

The third line creates a new DataFrame called df_sales_rate that merges the customer_id column of df_customer with the amount_2019 and amount_all columns of df_receipt_2019 and df_receipt_all, respectively. The merge is a left join, meaning that all rows of df_customer are included in the resulting DataFrame, but only the matching rows of df_receipt_2019 and df_receipt_all are included. The resulting DataFrame has three columns: customer_id, amount_2019, and amount_all.

The fourth and fifth lines fill any missing values in the amount_2019 and amount_all columns with 0.

The sixth line creates a new column called amount_rate in df_sales_rate by applying a lambda function to the amount_2019 and amount_all columns. The lambda function returns 0 if amount_2019 is 0, and the ratio of amount_2019 to amount_all otherwise. The resulting amount_rate column represents the sales rate of each customer in the year 2019 compared to their total sales.

The seventh line fills any missing values in the amount_rate column with 0.

The eighth line filters the rows of df_sales_rate to only include customers with a positive amount_rate, sorts the DataFrame by amount_rate in descending order, and returns the first 10 rows. These are the top 10 customers with the highest sales rate in the year 2019 compared to their total sales.

P-085: Create new customer data by linking geocode data (df_geocode) to all customers in the customer data (df_customer) using postcode (postal_cd). However, if multiple longitude (longitude) and latitude (latitude) information is associated with a single postcode (postal_cd), calculate and use the average values of longitude (longitude) and latitude (latitude). Also, display 10 results to check the creation results.

In [ ]:

df_geocode_1 = df_geocode.groupby('postal_cd') \
                .agg(m_longitude=('longitude', 'mean'), 
                     m_latitude=('latitude', 'mean')).reset_index()

df_customer_1 = pd.merge(df_customer, df_geocode_1, 
                         how='inner', on='postal_cd')

df_customer_1.head(10)

Out[ ]:

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd	m_longitude	m_latitude
0	CS021313000114	大野あや子	1	女性	1981-04-29	37	259-1113	神奈川県伊勢原市粟窪**********	S14021	20150905	0-00000000-0	139.31779	35.41358
1	CS021303000023	堀一徳	0	男性	1980-04-05	38	259-1113	神奈川県伊勢原市粟窪**********	S14021	20160411	0-00000000-0	139.31779	35.41358
2	CS021303000007	石倉俊二	0	男性	1987-07-04	31	259-1113	神奈川県伊勢原市粟窪**********	S14021	20150707	0-00000000-0	139.31779	35.41358
3	CS021313000183	草野未來	1	女性	1986-12-21	32	259-1113	神奈川県伊勢原市粟窪**********	S14021	20170611	0-00000000-0	139.31779	35.41358
4	CS021314000098	筒井れいな	1	女性	1985-09-21	33	259-1113	神奈川県伊勢原市粟窪**********	S14021	20160901	0-00000000-0	139.31779	35.41358
5	CS021314000093	江藤美佐	1	女性	1986-06-03	32	259-1113	神奈川県伊勢原市粟窪**********	S14021	20151015	0-00000000-0	139.31779	35.41358
6	CS021413000049	大野幸子	1	女性	1973-04-17	45	259-1113	神奈川県伊勢原市粟窪**********	S14021	20150703	0-00000000-0	139.31779	35.41358
7	CS037613000071	六角雅彦	9	不明	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0	139.83502	35.67193
8	CS015415000209	大谷倫子	1	女性	1970-11-25	48	136-0076	東京都江東区南砂**********	S13015	20150909	B-20090610-C	139.83502	35.67193
9	CS037614000045	長沢麻緒	1	女性	1952-06-19	66	136-0076	東京都江東区南砂**********	S13037	20150209	5-20091207-6	139.83502	35.67193

Commentary :

This code performs data processing on two Pandas DataFrames called df_geocode and df_customer. Here's a step-by-step explanation of what the code does:

The first line creates a new DataFrame called df_geocode_1 by grouping the rows of df_geocode by postal_cd (postal code) and calculating the mean longitude and latitude for each group. The resulting DataFrame has three columns: postal_cd, m_longitude, and m_latitude.

The second line creates a new DataFrame called df_customer_1 by merging df_customer and df_geocode_1 on the postal_cd column. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns of df_customer plus the m_longitude and m_latitude columns from df_geocode_1.

The third line displays the first 10 rows of the resulting DataFrame (df_customer_1) using the .head(10) method. This shows a sample of the merged data that includes customer information with their corresponding postal code's mean longitude and latitude values.

P-086: For the customer data with latitude(latitude) and longitude(longitude) created in 085, combine it with shop data (df_store) using the member application shop code (application_store_cd) as key. Calculate the distance (in km) between the application shop and the customer address using the latitude(latitude) and longitude information (longitude) of the application shop and the latitude and longitude of the customer address (address), and display it together with the customer ID (customer_id), customer address and shop address (address). The following simple formula should be used as the calculation formula, but libraries using other highly accurate methods are also acceptable. Display 10 results.

Latitude（radians）：φLongitude（radians）：λDistance L=6371∗arccos⁡(sin⁡φ1∗sinφ2+cos⁡φ1∗cos⁡φ2∗cos⁡(λ1−λ2))

In [ ]:

# Code example 1
def calc_distance(x1, y1, x2, y2):
    distance = 6371 * math.acos(math.sin(math.radians(x1)) 
                    * math.sin(math.radians(x2)) 
                    + math.cos(math.radians(x1)) 
                    * math.cos(math.radians(x2)) 
                    * math.cos(math.radians(y1) - math.radians(y2)))
    return distance

df_tmp = pd.merge(df_customer_1, df_store, 
                  how='inner', 
                  left_on='application_store_cd', 
                  right_on='store_cd') \
        .rename(columns={'address_x':'customer_address', 
                         'address_y':'store_address'})

df_tmp['distance'] =   df_tmp[['m_latitude', 
                               'm_longitude',
                               'latitude', 
                               'longitude']] \
                        .apply(lambda x: calc_distance(x[0], x[1], x[2], x[3]), 
                               axis=1)

df_tmp[['customer_id', 'customer_address', 
        'store_address', 'distance']].head(10)

Out[ ]:

	customer_id	customer_address	store_address	distance
0	CS021313000114	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
1	CS021303000023	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
2	CS021303000007	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
3	CS021313000183	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
4	CS021314000098	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
5	CS021314000093	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
6	CS021413000049	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
7	CS021313000025	神奈川県伊勢原市伊勢原**********	神奈川県伊勢原市伊勢原四丁目	0.474282
8	CS021413000022	神奈川県伊勢原市伊勢原**********	神奈川県伊勢原市伊勢原四丁目	0.474282
9	CS021413000094	神奈川県伊勢原市伊勢原**********	神奈川県伊勢原市伊勢原四丁目	0.474282

Commentary :

This code performs some geospatial calculations and data processing on three Pandas DataFrames called df_customer_1, df_store, and df_tmp. Here's a step-by-step explanation of what the code does:

The first line defines a Python function called calc_distance that takes four arguments: x1 and y1 (latitude and longitude of point 1) and x2 and y2 (latitude and longitude of point 2). The function uses the Haversine formula to calculate the distance between the two points in kilometers, assuming a spherical earth with radius of 6371 km. The result is returned as distance.

The second line creates a new DataFrame called df_tmp by merging df_customer_1 and df_store on the application_store_cd and store_cd columns, respectively. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns of both DataFrames, plus some renamed columns: address_x is renamed to customer_address, and address_y is renamed to store_address.

The third line creates a new column in df_tmp called distance by applying the calc_distance function to the m_latitude, m_longitude, latitude, and longitude columns using the .apply() method. The .apply() method is used with the axis=1 argument to apply the function row-wise. The resulting distance value is stored in the distance column.

The fourth line selects a subset of columns from df_tmp (customer_id, customer_address, store_address, and distance) using double square brackets and displays the first 10 rows of the resulting DataFrame using the .head(10) method. This shows a sample of the merged data that includes customer and store information with the corresponding distances between them.

In [ ]:

# Code example 2
def calc_distance_numpy(x1, y1, x2, y2):
    x1_r = np.radians(x1)
    x2_r = np.radians(x2)
    y1_r = np.radians(y1)
    y2_r = np.radians(y2)
    return 6371 * np.arccos(np.sin(x1_r) * np.sin(x2_r) 
                            + np.cos(x1_r) * np.cos(x2_r) 
                            * np.cos(y1_r - y2_r))

df_tmp = df_customer_1.merge(df_store, 
                             how='inner', 
                             left_on='application_store_cd', 
                             right_on='store_cd') \
                      .rename(columns={'address_x':'customer_address', 
                                       'address_y':'store_address'})

df_tmp['distance'] =  calc_distance_numpy(df_tmp['m_latitude'], 
                                          df_tmp['m_longitude'],
                                          df_tmp['latitude'], 
                                          df_tmp['longitude'])

df_tmp[['customer_id', 'customer_address', 
        'store_address', 'distance']].head(10)

Out[ ]:

	customer_id	customer_address	store_address	distance
0	CS021313000114	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
1	CS021303000023	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
2	CS021303000007	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
3	CS021313000183	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
4	CS021314000098	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
5	CS021314000093	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
6	CS021413000049	神奈川県伊勢原市粟窪**********	神奈川県伊勢原市伊勢原四丁目	1.394409
7	CS021313000025	神奈川県伊勢原市伊勢原**********	神奈川県伊勢原市伊勢原四丁目	0.474282
8	CS021413000022	神奈川県伊勢原市伊勢原**********	神奈川県伊勢原市伊勢原四丁目	0.474282
9	CS021413000094	神奈川県伊勢原市伊勢原**********	神奈川県伊勢原市伊勢原四丁目	0.474282

Commentary :

This code performs some geospatial calculations and data processing on three Pandas DataFrames called df_customer_1, df_store, and df_tmp. Here's a step-by-step explanation of what the code does:

The first line defines a Python function called calc_distance_numpy that takes four arguments: x1 and y1 (latitude and longitude of point 1) and x2 and y2 (latitude and longitude of point 2). The function uses the Haversine formula to calculate the distance between the two points in kilometers, assuming a spherical earth with radius of 6371 km. This implementation uses NumPy functions to perform the calculations more efficiently than the previous implementation.

The second line creates a new DataFrame called df_tmp by merging df_customer_1 and df_store on the application_store_cd and store_cd columns, respectively. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns of both DataFrames, plus some renamed columns: address_x is renamed to customer_address, and address_y is renamed to store_address.

The third line creates a new column in df_tmp called distance by calling the calc_distance_numpy function on four columns of df_tmp using NumPy array operations. Instead of using the .apply() method, this implementation takes advantage of NumPy's ability to perform element-wise computations on arrays. The resulting distance value is stored in the distance column.

The fourth line selects a subset of columns from df_tmp (customer_id, customer_address, store_address, and distance) using double square brackets and displays the first 10 rows of the resulting DataFrame using the .head(10) method. This shows a sample of the merged data that includes customer and store information with the corresponding distances between them.

P-087: In the customer data (df_customer), the same customer is registered more than once, e.g. due to applications at different shops. Create the named customer data, in which customers with the same name (customer_name) and postcode (postal_cd) are regarded as the same customer and are collated so that there is one record per customer, and calculate the number of customer data, the number of named customer data and the number of duplicates. However, for the same customer, the one with the highest total sales amount shall be retained, and for customers with the same total sales amount or no sales record, the one with the lowest customer ID (customer_id) number shall be retained.

In [ ]:

df_receipt_tmp = df_receipt.groupby('customer_id') \
                    .agg(sum_amount=('amount','sum')).reset_index()

df_customer_u = pd.merge(df_customer, df_receipt_tmp, 
                         how='left', 
                         on='customer_id')

df_customer_u['sum_amount'] = df_customer_u['sum_amount'].fillna(0)

df_customer_u = df_customer_u.sort_values(['sum_amount', 'customer_id'], 
                                          ascending=[False, True])

df_customer_u.drop_duplicates(subset=['customer_name', 'postal_cd'], 
                              keep='first', inplace=True)

print('df_customer_cnt:', len(df_customer),
      'df_customer_u_cnt:', len(df_customer_u),
      'diff:', len(df_customer) - len(df_customer_u))

df_customer_cnt: 21971 
df_customer_u_cnt: 21941 
diff: 30

Commentary :

This code performs the following tasks:

Group the df_receipt dataframe by customer_id and calculate the sum of amount for each customer. The resulting dataframe is stored in df_receipt_tmp.

Merge the df_customer dataframe with the df_receipt_tmp dataframe based on the customer_id column. This is a left join, so all customers in df_customer will be included in the resulting dataframe df_customer_u.

Replace any missing values in the sum_amount column with 0.

Sort the rows in df_customer_u by sum_amount in descending order, and then by customer_id in ascending order.

Remove any duplicates in df_customer_u based on the customer_name and postal_cd columns, keeping only the first occurrence of each duplicate. The resulting dataframe is stored back in df_customer_u.

Print out the number of rows in the original df_customer dataframe (df_customer_cnt), the number of rows in the updated df_customer_u dataframe (df_customer_u_cnt), and the difference between the two (diff).

Overall, this code is performing data cleaning and manipulation to create a new dataframe df_customer_u that includes information about each customer's total purchase amount (sum_amount), sorted by this amount in descending order, with any duplicates removed based on the customer_name and postal_cd columns.

P-088: Based on the data created in 087, create data with integrated name IDs for the customer data. However, the integrated name ID shall be assigned according to the following specifications.

Non-duplicated customers: set customer ID (customer_id)

Duplicate customers: set the customer ID of the record extracted in the previous question

The difference between the number of unique customer IDs and the number of unique integration name-drop IDs shall also be checked.

In [ ]:

df_customer_n = pd.merge(df_customer, 
                         df_customer_u[['customer_name', 
                                        'postal_cd', 'customer_id']],
                         how='inner', on =['customer_name', 'postal_cd'])

df_customer_n.rename(columns={'customer_id_x':'customer_id', 
                              'customer_id_y':'integration_id'}, inplace=True)

print('Difference in number of IDs', len(df_customer_n['customer_id'].unique()) 
                         - len(df_customer_n['integration_id'].unique()))

Difference in number of IDs 30

Commentary :

This code is merging two dataframes, df_customer and df_customer_u, to identify and match unique customers based on their name and postal code.

The resulting merged dataframe is stored in a new dataframe called df_customer_n. The merge function is used to join the two dataframes, and the argument on=['customer_name', 'postal_cd'] specifies that the merge should be performed based on these two columns.

The resulting merged dataframe contains two customer ID columns, customer_id_x and customer_id_y. To avoid confusion, the rename function is used to rename these columns to customer_id and integration_id, respectively.

The last line of code compares the number of unique customer IDs in the customer_id column with the number of unique customer IDs in the integration_id column. If the difference is zero, it means that there are no duplicate customers in the merged dataframe. If the difference is non-zero, it means that there are still some customers that were not successfully matched based on their name and postal code, and their IDs are still duplicated.

P – Quiet: df_customer_1, df_customer_n are not used and should be deleted.

In [ ]:

del df_customer_1
del df_customer_n

P-089: You want to split customers with sales records into training data and test data for building a forecasting model. Split the data randomly in the ratio of 8:2 respectively.

In [ ]:

df_sales_customer = df_receipt.groupby('customer_id') \
                                    .agg({'amount':sum}).reset_index()

df_sales_customer = df_sales_customer.query('amount > 0')

df_tmp = pd.merge(df_customer, df_sales_customer['customer_id'], 
                  how='inner', on='customer_id')

df_train, df_test = train_test_split(df_tmp, test_size=0.2, random_state=71)

print('Percentage of training data: ', len(df_train) / len(df_tmp))
print('Test Data Percentage: ', len(df_test) / len(df_tmp))

Percentage of training data:  0.7999036840837949
Test Data Percentage:  0.20009631591620516

Commentary :

This code performs a train-test split of the data based on the sales information of each customer.

First, it aggregates the sales information of each customer from the df_receipt dataframe by grouping the dataframe by customer_id and summing the amount column. The resulting dataframe is stored in df_sales_customer.

Next, it filters out the customers who did not make any purchases (i.e., customers with amount equal to 0) using the query method, and updates df_sales_customer to only include these customers.

Then, it performs an inner join of df_customer with the customer_id column of df_sales_customer, to only keep customers who made purchases. The resulting dataframe is stored in df_tmp.

Finally, it splits df_tmp into training and test sets using train_test_split method from scikit-learn, with a test size of 0.2 and a random state of 71. It then prints the percentage of data that is in the training and test sets.

P-090: The receipt details data (df_receipt) has data from 1 Jan 2017 to 31 Oct 2019. Create three sets of data for building time-series models, 12 months for training and 6 months for testing, by aggregating the sales amount (amount) on a monthly basis.

In [ ]:

# Code example 1 (home-made function)
df_ts_amount = df_receipt[['sales_ymd', 'amount']].copy()

df_ts_amount['sales_ym'] = df_ts_amount['sales_ymd'].astype('str').str[0:6]

df_ts_amount = df_ts_amount.groupby('sales_ym') \
                                        .agg({'amount':'sum'}).reset_index()

# Functionalise large numbers of data sets for long-term data so that they can be processed, e.g. in loops.
def split_data(df, train_size, test_size, slide_window, start_point):
    train_start = start_point * slide_window
    test_start = train_start + train_size
    return df[train_start:test_start], df[test_start:test_start + test_size]

df_train_1, df_test_1 = split_data(df_ts_amount, train_size=12, 
                                   test_size=6, slide_window=6, start_point=0)

df_train_2, df_test_2 = split_data(df_ts_amount, train_size=12, 
                                   test_size=6, slide_window=6, start_point=1)

df_train_3, df_test_3 = split_data(df_ts_amount, train_size=12, 
                                   test_size=6, slide_window=6, start_point=2)

In [ ]:

# df_train_2 and df_train_3 display omitted
df_train_1

Out[ ]:

	sales_ym	amount
0	201701	902056
1	201702	764413
2	201703	962945
3	201704	847566
4	201705	884010
5	201706	894242
6	201707	959205
7	201708	954836
8	201709	902037
9	201710	905739
10	201711	932157
11	201712	939654

In [ ]:

# df_test_2 and df_test_3 display omitted
df_test_1

Out[ ]:

	sales_ym	amount
12	201801	944509
13	201802	864128
14	201803	946588
15	201804	937099
16	201805	1004438
17	201806	1012329

Commentary :

The code is performing the following tasks:

df_ts_amount dataframe is created with columns 'sales_ymd', 'amount', and 'sales_ym'. sales_ym column is created from the first six characters of the sales_ymd column converted to a string.

Grouping the df_ts_amount dataframe by 'sales_ym' column and aggregating the 'amount' column with 'sum' function, and resetting the index. The resulting dataframe is stored in the same variable 'df_ts_amount'.

A function split_data is defined that takes in a dataframe df, train size, test size, slide window, and start point as arguments. It returns two dataframes - train and test dataframes.

The split_data function is called three times with different arguments each time to split the df_ts_amount dataframe into train and test dataframes. Three sets of train and test dataframes are created and stored in df_train_1, df_test_1, df_train_2, df_test_2, df_train_3, and df_test_3.

The purpose of splitting the data into multiple sets is probably to use them as training and validation sets for a time series forecasting model. By splitting the data into different sets, the model can be trained on different time periods, and the performance can be evaluated on different test periods. This helps to avoid overfitting and to get a more accurate estimate of the model's performance.

In [ ]:

# Code example 2 (TimeSeriesSplit in scikit-learn)
tscv = TimeSeriesSplit(gap=0, max_train_size=12, n_splits=3, test_size=6)

# TimeSeriesSplit is split so that the most recent data is used, but.
# Adjust the data period so that it is the same as in the SQL and R solution examples
# Unnecessary if you want to use the latest data possible
df_ts_amount = df_ts_amount.query('sales_ym <= "201906"')

series_list = []
for train_index, test_index in tscv.split(df_ts_amount):
    series_list.append((df_ts_amount.loc[train_index], 
                        df_ts_amount.loc[test_index]))
    
df_train_1, df_test_1 = series_list[0]
df_train_2, df_test_2 = series_list[1]
df_train_3, df_test_3 = series_list[2]

In [ ]:

# df_train_2 and df_train_3 display omitted
df_train_1

Out[ ]:

	sales_ym	amount
0	201701	902056
1	201702	764413
2	201703	962945
3	201704	847566
4	201705	884010
5	201706	894242
6	201707	959205
7	201708	954836
8	201709	902037
9	201710	905739
10	201711	932157
11	201712	939654

In [ ]:

# df_test_2 and df_test_3 display omitted
df_test_1

Out[ ]:

	sales_ym	amount
12	201801	944509
13	201802	864128
14	201803	946588
15	201804	937099
16	201805	1004438
17	201806	1012329

Commentary :

This code is performing time series cross-validation on a pandas DataFrame df_ts_amount. It uses the TimeSeriesSplit function from the sklearn.model_selection module to generate three train-test splits.

TimeSeriesSplit is a cross-validator that provides train/test indices to split time series data samples. It takes four parameters:

n_splits: the number of splits
max_train_size: maximum size of training data
test_size: size of the test data
gap: the number of samples to skip before starting a new split

The code sets n_splits=3, max_train_size=12, test_size=6, and gap=0 to create three splits with a training period of 12 months and a testing period of 6 months.

The code then filters the data to only include sales data up to June 2019 (df_ts_amount.query('sales_ym <= "201906"')) and creates a list series_list to store the train and test sets generated by TimeSeriesSplit. The for loop iterates over the splits generated by TimeSeriesSplit and appends each split to series_list. Finally, the code assigns the first train-test split to df_train_1 and df_test_1, the second split to df_train_2 and df_test_2, and the third split to df_train_3 and df_test_3.

P-091: For each customer in the customer data (df_customer), under-sample the number of customers with sales performance and the number of customers without sales performance so that the ratio is 1:1.

In [ ]:

df_tmp = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index()

df_tmp = pd.merge(df_customer, df_tmp, how='left', on='customer_id')

df_tmp['is_buy_flag'] = np.where(df_tmp['amount'].isnull(), 0, 1)

rs = RandomUnderSampler(random_state=71)

df_down_sampling, _ = rs.fit_resample(df_tmp, df_tmp.is_buy_flag)

print('Number of cases 0', len(df_down_sampling.query('is_buy_flag == 0')))
print('Number of cases 1', len(df_down_sampling.query('is_buy_flag == 1')))

Number of cases 0 8306
Number of cases 1 8306

Commentary :

This code performs random under-sampling to balance the binary target variable is_buy_flag.

First, the total purchase amount for each customer is computed using groupby and is stored in df_tmp. Then, df_customer and df_tmp are merged using a left join to include all customers, even if they have not made any purchases. The is_buy_flag column is created by using np.where to assign 1 to customers who made purchases and 0 to those who did not.

Next, the RandomUnderSampler function from the imblearn library is used to balance the target variable. The fit_resample method is applied to df_tmp with the is_buy_flag column as the target variable. This creates a new DataFrame, df_down_sampling, with an equal number of cases for both target variable values.

Finally, the number of cases with is_buy_flag values of 0 and 1 are printed to confirm that the random under-sampling was successful.

P-092: Normalise to third normal form for gender in customer data (df_customer).

In [ ]:

df_gender_std = df_customer[['gender_cd', 'gender']].drop_duplicates()

df_customer_std = df_customer.drop(columns='gender')

In [ ]:

# Check data content
df_customer_std.head(3)

Out[ ]:

	customer_id	customer_name	gender_cd	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
0	CS021313000114	大野あや子	1	1981-04-29	37	259-1113	神奈川県伊勢原市粟窪**********	S14021	20150905	0-00000000-0
1	CS037613000071	六角雅彦	9	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0
2	CS031415000172	宇多田貴美子	1	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C

In [ ]:

# Check data content
df_gender_std.head(3)

Out[ ]:

	gender_cd	gender
0	1	女性
1	9	不明
5	0	男性

Commentary :

In this code, df_gender_std is created by selecting the 'gender_cd' and 'gender' columns from the df_customer dataframe, and then removing duplicate rows using the drop_duplicates() method. This creates a new dataframe with unique combinations of 'gender_cd' and 'gender' values.

df_customer_std is created by dropping the 'gender' column from the df_customer dataframe using the drop() method with the columns parameter set to 'gender'. This creates a new dataframe with all the columns from df_customer except the 'gender' column.

The purpose of this code is to create two separate dataframes, one with unique combinations of 'gender_cd' and 'gender', and another with all columns from df_customer except 'gender'. These dataframes can then be used for further analysis or modeling.

P-093: Product data (df_product) only holds the code values for each category, but not the category names. Combine with category data (df_category) and denormalise to create new product data that holds category names.

In [ ]:

df_product_full = pd.merge(df_product, df_category[['category_small_cd', 
                                                    'category_major_name',
                                                    'category_medium_name',
                                                    'category_small_name']], 
                           how = 'inner', on = 'category_small_cd')

# Check data content
df_product_full.head(3)

Out[ ]:

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
0	P040101001	04	0401	040101	198.0	149.0	惣菜	御飯類	弁当類
1	P040101002	04	0401	040101	218.0	164.0	惣菜	御飯類	弁当類
2	P040101003	04	0401	040101	230.0	173.0	惣菜	御飯類	弁当類

Commentary :

This code performs an inner join between two dataframes df_product and df_category on the column category_small_cd. The resulting dataframe df_product_full contains all the columns from df_product and df_category, and the join is based on the matching values of category_small_cd.

The new dataframe df_product_full also contains additional columns from df_category, specifically category_major_name, category_medium_name, and category_small_name. These columns contain categorical information about the product category, and they are included in the merged dataframe using the how='inner' option, which keeps only the rows that have matching values in both dataframes. The .head(3) method is used to display the first 3 rows of the resulting merged dataframe.

P-094: File out the product data with category names created in 093 with the following specifications.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|

The path of the file output destination shall be as follows

|Output| |:–:| |./data|

In [ ]:

# Code example 1.
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_UTF-8_header.csv', 
                       encoding='UTF-8', index=False)

Commentary : 

This code exports the Pandas DataFrame df_product_full to a CSV file with a specified file path and name ../data/P_df_product_full_UTF-8_header.csv. The exported CSV file will use UTF-8 encoding to support a wide range of characters, and the argument index=False specifies that the row index of the DataFrame should not be included in the exported CSV file.

In [ ]:

# Code example 2 (with BOM to prevent garbled Excel)
df_product_full.to_csv('../data/P_df_product_full_UTF-8BOM_header.csv', 
                       encoding='utf_8_sig', index=False)

Commentary : 

This line of code exports the df_product_full DataFrame to a CSV file with a filename P_df_product_full_UTF-8BOM_header.csv.

The to_csv() method of a pandas DataFrame is used to write the contents of the DataFrame to a CSV file. It has several parameters, including the filename and the encoding to use.

In this case, the filename parameter is set to '../data/P_df_product_full_UTF-8BOM_header.csv', which specifies the directory and filename of the CSV file. The '../data/' prefix specifies the relative directory in which the file should be saved.

The encoding parameter is set to 'utf_8_sig', which specifies the encoding format to use for the CSV file. 'utf_8_sig' is a variant of UTF-8 encoding that includes a byte order mark (BOM) at the beginning of the file. The BOM is a special marker that indicates the byte order of the file, and can help programs determine the encoding format when opening the file.

The index parameter is set to False, which specifies that the DataFrame index should not be included in the CSV file.

P-095: Output the product data with category names created in 093 to a file with the following specifications.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|CP932|

The path of the file output destination shall be as follows

|Output| |:–:| |./data|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_CP932_header.csv', 
                       encoding='CP932', index=False)

Commentary :

This line of code exports a pandas DataFrame df_product_full to a CSV file named "P_df_product_full_CP932_header.csv" in the "../data/" directory.

The exported CSV file uses the "CP932" character encoding, which is a Japanese character encoding also known as "Windows-31J". This encoding is commonly used in Japan for text data and is compatible with Windows operating systems.

The parameter encoding='CP932' specifies the character encoding to use when exporting the CSV file.

The parameter index=False specifies that the index column should not be included in the exported CSV file.

P-096: Output the product data with category names created in 093 to a file with the following specifications.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|without|UTF-8|

The path of the file output destination shall be as follows

|Output| |:–:| |./data|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../data/P_df_product_full_UTF-8_noh.csv', 
                       header=False, encoding='UTF-8', index=False)

Commentary :

This code saves the df_product_full DataFrame to a CSV file called "P_df_product_full_UTF-8_noh.csv" in the "../data/" directory.

The first argument specifies the filename and path where the CSV file will be saved. The header argument is set to False, which means that the header row of the DataFrame will not be included in the output file. The encoding argument is set to 'UTF-8', which specifies the character encoding to use when writing the file. Finally, index is set to False, which means that the index of the DataFrame will not be included in the output file.

Overall, this code exports the df_product_full DataFrame to a CSV file without the header row and with UTF-8 encoding.

P-097: Load the file in the following format created in 094, display three data items and check that they have been imported correctly.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|With|UTF-8|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_header.csv',
                    dtype={'category_major_cd':str,
                           'category_medium_cd':str,
                           'category_small_cd':str},
                    encoding='UTF-8')

df_product_full.head(3)

Out[ ]:

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
0	P040101001	04	0401	040101	198.0	149.0	惣菜	御飯類	弁当類
1	P040101002	04	0401	040101	218.0	164.0	惣菜	御飯類	弁当類
2	P040101003	04	0401	040101	230.0	173.0	惣菜	御飯類	弁当類

Commentary :

The code above reads a CSV file P_df_product_full_UTF-8_noh.csv with UTF-8 encoding and no header row, and assigns column names to the dataframe.

pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv', dtype={1:str, 2:str, 3:str}, encoding='UTF-8', header=None): This reads the CSV file without a header row, and sets the second, third and fourth columns to string data type. The resulting dataframe has columns named 0, 1, 2, 3, 4, 5, 6, 7 and 8.

df_product_full.columns = ['product_cd','category_major_cd', 'category_medium_cd', 'category_small_cd', 'unit_price','unit_cost','category_major_name', 'category_medium_name', 'category_small_name']: This assigns new column names to the dataframe columns. The resulting dataframe has columns named product_cd, category_major_cd, category_medium_cd, category_small_cd, unit_price, unit_cost, category_major_name, category_medium_name, and category_small_name.

df_product_full.head(3): This displays the first 3 rows of the dataframe df_product_full.

P-098: Load the file in the following format created in 096, display three data items and check that they are imported correctly.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |CSV (comma separated)|Without|UTF-8|

In [ ]:

# Code example 1 (item names to be added later).
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv',
                     dtype={1:str,
                            2:str,
                            3:str},
                     encoding='UTF-8', header=None)

df_product_full.columns = ['product_cd','category_major_cd',
                           'category_medium_cd', 'category_small_cd',
                           'unit_price','unit_cost','category_major_name',
                           'category_medium_name', 'category_small_name']

df_product_full.head(3)

Out[ ]:

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
0	P040101001	04	0401	040101	198.0	149.0	惣菜	御飯類	弁当類
1	P040101002	04	0401	040101	218.0	164.0	惣菜	御飯類	弁当類
2	P040101003	04	0401	040101	230.0	173.0	惣菜	御飯類	弁当類

Commentary :

This code reads a CSV file named 'P_df_product_full_UTF-8_noh.csv' located in the '../data' directory. It specifies the column names using the 'names' parameter as a list of strings called 'c_names'. The 'dtype' parameter is used to specify the data type of the columns named 'category_major_cd', 'category_medium_cd', and 'category_small_cd' as 'str' (string). The 'encoding' parameter specifies that the CSV file is encoded using 'UTF-8'. Finally, the 'header' parameter is set to 'None' to indicate that the file does not have a header row.

The resulting dataframe is assigned to the variable 'df_product_full' and the first three rows are displayed using the 'head' method.

In [ ]:

# Code example 2 (define item name first)
c_names = ['product_cd','category_major_cd','category_medium_cd',
           'category_small_cd','unit_price','unit_cost',
           'category_major_name','category_medium_name','category_small_name']

df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_noh.csv',
                     names=c_names,
                     dtype={'category_major_cd':str,
                            'category_medium_cd':str,
                            'category_small_cd':str},
                     encoding='UTF-8', header=None)

df_product_full.head(3)

Out[ ]:

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
0	P040101001	04	0401	040101	198.0	149.0	惣菜	御飯類	弁当類
1	P040101002	04	0401	040101	218.0	164.0	惣菜	御飯類	弁当類
2	P040101003	04	0401	040101	230.0	173.0	惣菜	御飯類	弁当類

Commentary :

This code reads a CSV file into a Pandas DataFrame with customized column names using the read_csv function.

Here is an explanation of the code:

c_names: A list of strings that define the customized column names.
pd.read_csv(): A Pandas function to read a CSV file and return a DataFrame.
'../data/P_df_product_full_UTF-8_noh.csv': The file path of the CSV file to be read.

names=c_names: Specifies the customized column names using the c_names list.
dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}: Specifies that the data type of the columns category_major_cd, category_medium_cd, and category_small_cd should be str.

encoding='UTF-8': Specifies that the file encoding should be 'UTF-8'.
header=None: Specifies that there is no header row in the CSV file.

Finally, the resulting DataFrame is displayed using the head() method to show the first three rows.

P-099: Output the product data with category names created in 093 to a file with the following specifications.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|

The path of the file output destination shall be as follows

|Output| |:–:| |./data|

In [ ]:

# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full.to_csv('../d/P_df_product_full_UTF-8_header.tsv', 
                       sep='\t', encoding='UTF-8', index=False)

Commentary :

The code is writing the contents of a pandas DataFrame called df_product_full to a TSV (Tab-Separated Values) file located at '../d/P_df_product_full_UTF-8_header.tsv'.

The .to_csv() method is called on the df_product_full DataFrame, which is a method provided by pandas for exporting data to a CSV (Comma-Separated Values) or TSV file.

The first argument to the method, '../d/P_df_product_full_UTF-8_header.tsv', is the file path and name where the DataFrame should be saved. The file will be saved in the ../d/ directory with the name P_df_product_full_UTF-8_header.tsv.

The second argument, sep='\t', specifies that the values in the file should be separated by tabs instead of commas.

The third argument, encoding='UTF-8', specifies that the encoding of the file should be UTF-8, which is a standard encoding for Unicode characters.

The fourth argument, index=False, specifies that the DataFrame's index should not be included in the output file.

P-100: Load the file in the following format created in 099, display three items of data and check that they have been imported correctly.

|File format|With/without header|Character encoding| |:–:|:–:|:–:| |TSV (Tab Separated)|With|UTF-8|

In [ ]:

# Code example 1 (read_table).
# Note that the location of the answer file is different from the question file, so the path is '. /data' because the location of the answer file is different from the question file.
df_product_full = pd.read_table('../data/P_df_product_full_UTF-8_header.tsv', 
                       dtype={'category_major_cd':str,
                              'category_medium_cd':str,
                              'category_small_cd':str},
                       encoding='UTF-8')
df_product_full.head(3)

Out[ ]:

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
0	P040101001	04	0401	040101	198.0	149.0	惣菜	御飯類	弁当類
1	P040101002	04	0401	040101	218.0	164.0	惣菜	御飯類	弁当類
2	P040101003	04	0401	040101	230.0	173.0	惣菜	御飯類	弁当類

Commentary :

The code reads data from a TSV (Tab-Separated Values) file located at '../data/P_df_product_full_UTF-8_header.tsv' into a pandas DataFrame called df_product_full.

The pd.read_table() function is called to read the data from the file. The first argument to the function, '../data/P_df_product_full_UTF-8_header.tsv', specifies the path and name of the file to read.

The second argument to the function, dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}, specifies the data types for certain columns in the DataFrame. Specifically, the columns category_major_cd, category_medium_cd, and category_small_cd are being set to be of type string (str).

The third argument to the function, encoding='UTF-8', specifies that the encoding of the file is UTF-8.

The resulting DataFrame, df_product_full, is then displayed with the head(3) method, which shows the first three rows of the DataFrame.

In [ ]:

# Code example 2 (read_csv)
df_product_full = pd.read_csv('../data/P_df_product_full_UTF-8_header.tsv', 
                       dtype={'category_major_cd':str,
                              'category_medium_cd':str,
                              'category_small_cd':str},
                       sep='\t', encoding='UTF-8')
df_product_full.head(3)

Out[ ]:

	product_cd	category_major_cd	category_medium_cd	category_small_cd	unit_price	unit_cost	category_major_name	category_medium_name	category_small_name
0	P040101001	04	0401	040101	198.0	149.0	惣菜	御飯類	弁当類
1	P040101002	04	0401	040101	218.0	164.0	惣菜	御飯類	弁当類
2	P040101003	04	0401	040101	230.0	173.0	惣菜	御飯類	弁当類

Commentary :

The code reads data from a TSV (Tab-Separated Values) file located at '../data/P_df_product_full_UTF-8_header.tsv' into a pandas DataFrame called df_product_full.

The pd.read_csv() function is called to read the data from the file. The first argument to the function, '../data/P_df_product_full_UTF-8_header.tsv', specifies the path and name of the file to read.

The second argument to the function, dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}, specifies the data types for certain columns in the DataFrame. Specifically, the columns category_major_cd, category_medium_cd, and category_small_cd are being set to be of type string (str).

The third argument to the function, sep='\t', specifies that the values in the file are separated by tabs instead of commas.

The fourth argument to the function, encoding='UTF-8', specifies that the encoding of the file is UTF-8.

The resulting DataFrame, df_product_full, is then displayed with the head(3) method, which shows the first three rows of the DataFrame.

This is the end of the 100. Well done!

Data Science 100 Knocks (Structured Data Processing) - Python

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.