参照(Reference) : 「データサイエンティスト協会スキル定義委員」の「データサイエンス100本ノック（構造化データ加工編）」

The Data Scientist Society Github :

GitHub - The-Japan-DataScientist-Society/100knocks-preprocess: データサイエンス100本ノック（構造化データ加工編）

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Data Science 100 Knocks (Structured Data Processing) URL :

100knocks-preprocess/docker/work/answer/ans_preprocess_knock_Python.ipynb at master · The-Japan-DataScientist-Society/100knocks-preprocess

データサイエンス100本ノック（構造化データ加工編）. Contribute to The-Japan-DataScientist-Society/100knocks-preprocess development by creating a...

Note: This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL.
The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers.
Since this data was created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

You can get preprocess_knock files to practice python, SQL, R from my github account

ChanhiYasutomi - Repositories

ChanhiYasutomi has 20 repositories available. Follow their code on GitHub.

Getting Started

First, run the following cells
Import the necessary libraries and read data from the database (PostgreSQL)
The following cell imports libraries that are expected to be used, such as pandas.
If there are other libraries you wish to use, install them as needed (you can also install them by “!pip install library name”).
The process can be divided into multiple times.
Names, addresses, etc. are dummy data and are not real.

In [ ]:

import os
import pandas as pd
import numpy as np
from datetime import datetime, date
from dateutil.relativedelta import relativedelta
import math
import psycopg2
from sqlalchemy import create_engine
from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from imblearn.under_sampling import RandomUnderSampler


if 'PG_PORT' in os.environ:
    pgconfig = {
        'host': 'db',
        'port': os.environ['PG_PORT'],
        'database': os.environ['PG_DATABASE'],
        'user': os.environ['PG_USER'],
        'password': os.environ['PG_PASSWORD'],
    }


    # Connector for pd.read_sql
    conn = psycopg2.connect(**pgconfig)

    df_customer = pd.read_sql(sql='select * from customer', con=conn)
    df_category = pd.read_sql(sql='select * from category', con=conn)
    df_product = pd.read_sql(sql='select * from product', con=conn)
    df_receipt = pd.read_sql(sql='select * from receipt', con=conn)
    df_store = pd.read_sql(sql='select * from store', con=conn)
    df_geocode = pd.read_sql(sql='select * from geocode', con=conn)

else:
    if not os.path.exists('data/'):
        !git clone https://github.com/The-Japan-DataScientist-Society/100knocks-preprocess
        os.chdir('100knocks-preprocess/docker/work/')

    dtype = {
        'customer_id': str,
        'gender_cd': str,
        'postal_cd': str,
        'application_store_cd': str,
        'status_cd': str,
        'category_major_cd': str,
        'category_medium_cd': str,
        'category_small_cd': str,
        'product_cd': str,
        'store_cd': str,
        'prefecture_cd': str,
        'tel_no': str,
        'postal_cd': str,
        'street': str
    }

    df_customer = pd.read_csv("data/customer.csv", dtype=dtype)
    df_category = pd.read_csv("data/category.csv", dtype=dtype)
    df_product = pd.read_csv("data/product.csv", dtype=dtype)
    df_receipt = pd.read_csv("data/receipt.csv", dtype=dtype)
    df_store = pd.read_csv("data/store.csv", dtype=dtype)
    df_geocode = pd.read_csv("data/geocode.csv", dtype=dtype)

fatal: destination path '100knocks-preprocess' already exists and is not an empty directory.

Exercise

P-001: Display the first 10 items of all items from the receipt details data (df_receipt) and visually check what data is held.

In [ ]:

df_receipt.head(10)

Out[ ]:

	sales_ymd	sales_epoch	store_cd	receipt_no	receipt_sub_no	customer_id	product_cd	quantity	amount
0	20181103	1541203200	S14006	112	1	CS006214000001	P070305012	1	158
1	20181118	1542499200	S13008	1132	2	CS008415000097	P070701017	1	81
2	20170712	1499817600	S14028	1102	1	CS028414000014	P060101005	1	170
3	20190205	1549324800	S14042	1132	1	ZZ000000000000	P050301001	1	25
4	20180821	1534809600	S14025	1102	2	CS025415000050	P060102007	1	90
5	20190605	1559692800	S13003	1112	1	CS003515000195	P050102002	1	138
6	20181205	1543968000	S14024	1102	2	CS024514000042	P080101005	1	30
7	20190922	1569110400	S14040	1102	1	CS040415000178	P070501004	1	128
8	20170504	1493856000	S13020	1112	2	ZZ000000000000	P071302010	1	770
9	20191010	1570665600	S14027	1102	1	CS027514000015	P071101003	1	680

Commentary : 

The code df_receipt.head(10) is used to display the first 10 rows of a Pandas DataFrame called df_receipt.

In Pandas, a DataFrame is a two-dimensional labeled data structure with columns of potentially different data types. It is similar to a spreadsheet or SQL table.

The .head() method in Pandas is used to display the first n rows of a DataFrame. By default, it displays the first 5 rows. In this case, .head(10) is used to display the first 10 rows of the DataFrame.

P-002: Display 10 items from the receipt details data (df_receipt), specifying the columns in the order of sales_date (sales_ymd), customer_id (customer_id), product_code (product_cd) and sales amount (amount).

In [ ]:

df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].head(10)

Out[ ]:

	sales_ymd	customer_id	product_cd	amount
0	20181103	CS006214000001	P070305012	158
1	20181118	CS008415000097	P070701017	81
2	20170712	CS028414000014	P060101005	170
3	20190205	ZZ000000000000	P050301001	25
4	20180821	CS025415000050	P060102007	90
5	20190605	CS003515000195	P050102002	138
6	20181205	CS024514000042	P080101005	30
7	20190922	CS040415000178	P070501004	128
8	20170504	ZZ000000000000	P071302010	770
9	20191010	CS027514000015	P071101003	680

Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].head(10) is used to display the first 10 rows of a subset of columns from a Pandas DataFrame called df_receipt.

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The resulting DataFrame contains only these selected columns. The .head(10) method is then used to display the first 10 rows of this subset DataFrame.

P-003: From the receipt details data (df_receipt), specify the columns in order of sales_date (sales_ymd), customer_id (customer_id), product_code (product_cd) and sales_amount (amount) and display 10 items. However, extract while renaming the items from sales_ymd to sales_date.

In [ ]:

df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']]. \
                        rename(columns={'sales_ymd': 'sales_date'}).head(10)

Out[ ]:

	sales_date	customer_id	product_cd	amount
0	20181103	CS006214000001	P070305012	158
1	20181118	CS008415000097	P070701017	81
2	20170712	CS028414000014	P060101005	170
3	20190205	ZZ000000000000	P050301001	25
4	20180821	CS025415000050	P060102007	90
5	20190605	CS003515000195	P050102002	138
6	20181205	CS024514000042	P080101005	30
7	20190922	CS040415000178	P070501004	128
8	20170504	ZZ000000000000	P071302010	770
9	20191010	CS027514000015	P071101003	680

Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].rename(columns={'sales_ymd': 'sales_date'}).head(10) is used to select a subset of columns from a Pandas DataFrame called df_receipt, rename one of the columns, and then display the first 10 rows of the resulting DataFrame.

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .rename() method is then used to rename the sales_ymd column to sales_date. This method returns a new DataFrame with the renamed column. The argument columns={'sales_ymd': 'sales_date'} specifies a dictionary that maps the old column name to the new column name.

Finally, the .head(10) method is used to display the first 10 rows of the resulting DataFrame.

P-004: From receipt details data (df_receipt), specify columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and extract data satisfying the following conditions.

Customer ID (customer_id) is “CS018205000001”

In [ ]:

# Code example 1 (when using query)
df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']]. \
                                query('customer_id == "CS018205000001"')

Out[ ]:

	sales_ymd	customer_id	product_cd	amount
36	20180911	CS018205000001	P071401012	2200
9843	20180414	CS018205000001	P060104007	600
21110	20170614	CS018205000001	P050206001	990
27673	20170614	CS018205000001	P060702015	108
27840	20190216	CS018205000001	P071005024	102
28757	20180414	CS018205000001	P071101002	278
39256	20190226	CS018205000001	P070902035	168
58121	20190924	CS018205000001	P060805001	495
68117	20190226	CS018205000001	P071401020	2200
72254	20180911	CS018205000001	P071401005	1100
88508	20190216	CS018205000001	P040101002	218
91525	20190924	CS018205000001	P091503001	280

Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].query('customer_id == "CS018205000001"') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001).

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on a specific condition. In this case, the condition is that the customer_id column must be equal to the value "CS018205000001". The backslash (\) at the end of the first line indicates that the code is continued on the next line for better readability.

The resulting DataFrame contains only the rows where the customer_id column matches the specified value.

In [ ]:

# Code example 1 (without query)
df = df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']]
df[df['customer_id'] == 'CS018205000001']

Out[ ]:

	sales_ymd	customer_id	product_cd	amount
36	20180911	CS018205000001	P071401012	2200
9843	20180414	CS018205000001	P060104007	600
21110	20170614	CS018205000001	P050206001	990
27673	20170614	CS018205000001	P060702015	108
27840	20190216	CS018205000001	P071005024	102
28757	20180414	CS018205000001	P071101002	278
39256	20190226	CS018205000001	P070902035	168
58121	20190924	CS018205000001	P060805001	495
68117	20190226	CS018205000001	P071401020	2200
72254	20180911	CS018205000001	P071401005	1100
88508	20190216	CS018205000001	P040101002	218
91525	20190924	CS018205000001	P091503001	280

Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].query('customer_id == "CS018205000001"') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001).

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on a specific condition. In this case, the condition is that the customer_id column must be equal to the value "CS018205000001". The backslash (\) at the end of the first line indicates that the code is continued on the next line for better readability.

The resulting DataFrame contains only the rows where the customer_id column matches the specified value.

P-005: Extract data from receipt details data (df_receipt), specifying columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and satisfying all the following conditions.

Customer ID (customer_id) is “CS018205000001”.

Sales amount (amount) is 1,000 or more.

In [ ]:

df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \
            .query('customer_id == "CS018205000001" & amount >= 1000')

Out[ ]:

	sales_ymd	customer_id	product_cd	amount
36	20180911	CS018205000001	P071401012	2200
68117	20190226	CS018205000001	P071401020	2200
72254	20180911	CS018205000001	P071401005	1100

Commentary :
 
The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & amount >= 1000') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001) and the amount column is greater than or equal to 1000.

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on multiple conditions. The & symbol is used to indicate a logical AND operation between the two conditions. The first condition is that the customer_id column must be equal to the value "CS018205000001". The second condition is that the amount column must be greater than or equal to 1000.

The resulting DataFrame contains only the rows where both conditions are satisfied.

P-006: Extract data from receipt details data (df_receipt), specifying the columns in the order of sales_date (sales_ymd), customer ID (customer_id), product code (product_cd), sales quantity (quantity) and sales amount (amount) and satisfying all the following conditions.

Customer ID (customer_id) is “CS018205000001”.

Sales amount (amount) is 1,000 or more or sales quantity (quantity) is 5 or more.

In [ ]:

df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']].\
    query('customer_id == "CS018205000001" & (amount >= 1000 | quantity >=5)')

Out[ ]:

	sales_ymd	customer_id	product_cd	quantity	amount
36	20180911	CS018205000001	P071401012	1	2200
9843	20180414	CS018205000001	P060104007	6	600
21110	20170614	CS018205000001	P050206001	5	990
68117	20190226	CS018205000001	P071401020	1	2200
72254	20180911	CS018205000001	P071401005	1	1100

Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']].query('customer_id == "CS018205000001" & (amount >= 1000 | quantity >=5)') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001) and either the amount column is greater than or equal to 1000 or the quantity column is greater than or equal to 5.

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, quantity, and amount.

The .query() method is then used to filter the rows of the DataFrame based on multiple conditions. The & symbol is used to indicate a logical AND operation between the two conditions. The first condition is that the customer_id column must be equal to the value "CS018205000001". The second condition is enclosed in parentheses and contains two sub-conditions separated by the | symbol, which indicates a logical OR operation. The first sub-condition is that the amount column must be greater than or equal to 1000. The second sub-condition is that the quantity column must be greater than or equal to 5.

The resulting DataFrame contains only the rows where both the customer_id is equal to the specified value and either the amount is greater than or equal to 1000 or the quantity is greater than or equal to 5.

P-007: From the receipt details data (df_receipt), specify the columns in order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount) and extract data that satisfy all the following conditions.

Customer ID (customer_id) is “CS018205000001”.

Sales amount (amount) is between 1,000 and 2,000.

In [ ]:

df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \
    .query('customer_id == "CS018205000001" & 1000 <= amount <= 2000')

Out[ ]:

	sales_ymd	customer_id	product_cd	amount
72254	20180911	CS018205000001	P071401005	1100

Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & 1000 <= amount <= 2000') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001) and the amount column is between 1000 and 2000 (inclusive).

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on multiple conditions. The & symbol is used to indicate a logical AND operation between the two conditions. The first condition is that the customer_id column must be equal to the value "CS018205000001". The second condition specifies a range of values for the amount column, using the <= and >= comparison operators.

The resulting DataFrame contains only the rows where both the customer_id is equal to the specified value and the amount is between 1000 and 2000 (inclusive).

P-008: From the receipt details data (df_receipt), specify the columns in order of sales date (sales_ymd), customer ID (customer_id), product code (product_cd) and sales amount (amount), and extract data satisfying all the following conditions.

Customer ID (customer_id) is “CS018205000001”.

Product code (product_cd) is other than “P071401019”.

In [ ]:

df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \
    .query('customer_id == "CS018205000001" & product_cd != "P071401019"')

Out[ ]:

	sales_ymd	customer_id	product_cd	amount
36	20180911	CS018205000001	P071401012	2200
9843	20180414	CS018205000001	P060104007	600
21110	20170614	CS018205000001	P050206001	990
27673	20170614	CS018205000001	P060702015	108
27840	20190216	CS018205000001	P071005024	102
28757	20180414	CS018205000001	P071101002	278
39256	20190226	CS018205000001	P070902035	168
58121	20190924	CS018205000001	P060805001	495
68117	20190226	CS018205000001	P071401020	2200
72254	20180911	CS018205000001	P071401005	1100
88508	20190216	CS018205000001	P040101002	218
91525	20190924	CS018205000001	P091503001	280

Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & product_cd != "P071401019"') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001) and the product_cd column is not equal to a specific value (P071401019).

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on multiple conditions. The & symbol is used to indicate a logical AND operation between the two conditions. The first condition is that the customer_id column must be equal to the value "CS018205000001". The second condition is that the product_cd column must not be equal to the value "P071401019", which is achieved using the != comparison operator.

The resulting DataFrame contains only the rows where both the customer_id is equal to the specified value and the product_cd is not equal to the specified value.

P-009: In the following process, rewrite OR to AND without changing the output result.

df_store.query('not(prefecture_cd == "13" | floor_area > 900)')

In [ ]:

df_store.query('prefecture_cd != "13" & floor_area <= 900')

Out[ ]:

	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
18	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.5916	35.56189	831.0
20	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.6316	35.54655	890.0
38	S12013	習志野店	12	千葉県	千葉県習志野市芝園一丁目	チバケンナラシノシシバゾノイッチョウメ	047-123-4002	140.0220	35.66122	808.0

Commentary : 

The code df_store.query('prefecture_cd != "13" & floor_area <= 900') is used to filter rows from a Pandas DataFrame called df_store, where the prefecture_cd column is not equal to a specific value (13) AND the floor_area column is less than or equal to a specific value (900).

The .query() method is used to filter the rows of the DataFrame based on a Boolean expression. In this case, the Boolean expression is 'prefecture_cd != "13" & floor_area <= 900'.

The != operator is used to check if each value in the prefecture_cd column is not equal to the value "13". The & operator is used to indicate a logical AND operation between this condition and the second condition, which checks if the values in the floor_area column are less than or equal to 900.

The resulting DataFrame contains only the rows where both conditions are True.

P-010: From the shop data (df_store), extract all items whose shop code (store_cd) starts with “S14” and display 10 items.

In [ ]:

df_store.query("store_cd.str.startswith('S14')", engine='python').head(10)

Out[ ]:

	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
2	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.6326	35.50049	1732.0
3	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.4961	35.45918	1495.0
4	S14036	相模原中央店	14	神奈川県	神奈川県相模原市中央二丁目	カナガワケンサガミハラシチュウオウニチョウメ	042-123-4045	139.3716	35.57327	1679.0
7	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.4994	35.52398	1548.0
9	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.4961	35.45918	1830.0
12	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.4963	35.46304	1574.0
16	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.6582	35.42156	1341.0
18	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.5916	35.56189	831.0
19	S14022	逗子店	14	神奈川県	神奈川県逗子市逗子一丁目	カナガワケンズシシズシイッチョウメ	046-123-4036	139.5789	35.29642	1838.0
20	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.6316	35.54655	890.0

Commentary : 

The code df_store.query("store_cd.str.startswith('S14')", engine='python').head(10) is used to filter rows from a Pandas DataFrame called df_store, where the store_cd column starts with a specific string (S14).

The .query() method is used to filter the rows of the DataFrame based on a Boolean expression. In this case, the Boolean expression is "store_cd.str.startswith('S14')". This expression is written in Python syntax because the engine parameter is set to 'python'. The expression checks whether the store_cd column starts with the string 'S14'.

The str.startswith() method is a string method in Python that returns True if the string starts with the specified prefix. In this case, the str.startswith('S14') method is applied to each element in the store_cd column.

The resulting DataFrame contains only the rows where the store_cd column starts with the string 'S14'.

The .head(10) method is used to return only the first 10 rows of the resulting DataFrame.

P-011: From the customer data (df_customer), extract all items whose customer ID (customer_id) ends with 1, and display 10 items.

In [ ]:

df_customer.query("customer_id.str.endswith('1')", engine='python').head(10)

Out[ ]:

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
1	CS037613000071	六角雅彦	9	不明	1952-04-01	66	136-0076	東京都江東区南砂**********	S13037	20150414	0-00000000-0
3	CS028811000001	堀井かおり	1	女性	1933-03-27	86	245-0016	神奈川県横浜市泉区和泉町**********	S14028	20160115	0-00000000-0
14	CS040412000191	川井郁恵	1	女性	1977-01-05	42	226-0021	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4
31	CS028314000011	小菅あおい	1	女性	1983-11-26	35	246-0038	神奈川県横浜市瀬谷区宮沢**********	S14028	20151123	1-20080426-5
56	CS039212000051	藤島恵梨香	1	女性	1997-02-03	22	166-0001	東京都杉並区阿佐谷北**********	S13039	20171121	1-20100215-4
59	CS015412000111	松居奈月	1	女性	1972-10-04	46	136-0071	東京都江東区亀戸**********	S13015	20150629	0-00000000-0
63	CS004702000041	野島洋	0	男性	1943-08-24	75	176-0022	東京都練馬区向山**********	S13004	20170218	0-00000000-0
74	CS041515000001	栗田千夏	1	女性	1967-01-02	52	206-0001	東京都多摩市和田**********	S13041	20160422	E-20100803-F
85	CS029313000221	北条ひかり	1	女性	1987-06-19	31	279-0011	千葉県浦安市美浜**********	S12029	20180810	0-00000000-0
102	CS034312000071	望月奈央	1	女性	1980-09-20	38	213-0026	神奈川県川崎市高津区久末**********	S14034	20160106	0-00000000-0

Commentary : 

This code is selecting a subset of data from a Pandas DataFrame named df_customer. The subset is filtered based on a condition specified in the query() method.

The condition is defined by the string "customer_id.str.endswith('1')", which is passed as an argument to the query() method. This condition filters the rows of the df_customer DataFrame to only include those where the customer_id column ends with the character '1'.

The engine='python' parameter is optional and specifies that the Python engine should be used to evaluate the condition. This is necessary when the condition contains string operations, as the default engine for query() method (pandas.eval) may not support all string operations.

Finally, the head(10) method is used to return the first 10 rows of the filtered DataFrame. This is optional and is used here to limit the output to a manageable size.

P-012: From shop data (df_store), display all items with “横浜市(Yokohama-city)” in the address.

In [ ]:

df_store.query("address.str.contains('横浜市')", engine='python')

Out[ ]:

	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
2	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.6326	35.50049	1732.0
3	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.4961	35.45918	1495.0
7	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.4994	35.52398	1548.0
9	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.4961	35.45918	1830.0
12	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.4963	35.46304	1574.0
16	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.6582	35.42156	1341.0
18	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.5916	35.56189	831.0
20	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.6316	35.54655	890.0
26	S14048	中川中央店	14	神奈川県	神奈川県横浜市都筑区中川中央二丁目	カナガワケンヨコハマシツヅキクナカガワチュウオウニチョウメ	045-123-4051	139.5758	35.54912	1657.0
40	S14042	新山下店	14	神奈川県	神奈川県横浜市中区新山下二丁目	カナガワケンヨコハマシナカクシンヤマシタニチョウメ	045-123-4047	139.6593	35.43894	1044.0
52	S14006	葛が谷店	14	神奈川県	神奈川県横浜市都筑区葛が谷	カナガワケンヨコハマシツヅキククズガヤ	045-123-4031	139.5633	35.53573	1886.0

Commentary : 

Specifically, the condition is defined by the string "address.str.contains('横浜市')", which is passed as an argument to query(). This string is a boolean expression that checks if the address column of df_store contains the substring '横浜市'.

The str.contains() method is a Pandas string method that returns a boolean array indicating whether each element of a given string column contains a specified substring. In this case, it is applied to the address column of df_store.

The engine='python' argument specifies that the filtering operation should be performed using Python's built-in eval() function, rather than the default numexpr engine. This is necessary when using certain types of string operations in the query expression, such as str.contains(), which are not supported by the numexpr engine.

P-013: From the customer data (df_customer), extract all data whose status code (status_cd) begins with the letters A to F and display 10 items.

In [ ]:

df_customer.query("status_cd.str.contains(r'^[A-F]')", 
                  engine='python').head(10)

Out[ ]:

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
2	CS031415000172	宇多田貴美子	1	女性	1976-10-04	42	151-0053	東京都渋谷区代々木**********	S13031	20150529	D-20100325-C
6	CS015414000103	奥野陽子	1	女性	1977-08-09	41	136-0073	東京都江東区北砂**********	S13015	20150722	B-20100609-B
12	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
15	CS029415000023	梅田里穂	1	女性	1976-01-17	43	279-0043	千葉県浦安市富士見**********	S12029	20150610	D-20100918-E
21	CS035415000029	寺沢真希	9	不明	1977-09-27	41	158-0096	東京都世田谷区玉川台**********	S13035	20141220	F-20101029-F
32	CS031415000106	宇野由美子	1	女性	1970-02-26	49	151-0053	東京都渋谷区代々木**********	S13031	20150201	F-20100511-E
33	CS029215000025	石倉美帆	1	女性	1993-09-28	25	279-0022	千葉県浦安市今川**********	S12029	20150708	B-20100820-C
40	CS033605000005	猪股雄太	0	男性	1955-12-05	63	246-0031	神奈川県横浜市瀬谷区瀬谷**********	S14033	20150425	F-20100917-E
44	CS033415000229	板垣菜々美	1	女性	1977-11-07	41	246-0021	神奈川県横浜市瀬谷区二ツ橋町**********	S14033	20150712	F-20100326-E
53	CS008415000145	黒谷麻緒	1	女性	1977-06-27	41	157-0067	東京都世田谷区喜多見**********	S13008	20150829	F-20100622-F

Commentary : 

df_customer is a DataFrame object.

.query() is a method provided by Pandas for filtering rows based on a specific condition. The condition is specified as a string argument.

The condition in this case is "status_cd.str.contains(r'^[A-F]')". This uses the str.contains() method to check whether the status_cd column of the DataFrame contains any strings that start with the letters A through F (inclusive). The r before the regular expression pattern '^[A-F]' indicates a raw string literal, which ensures that the backslash character (\) is interpreted literally and not as an escape character.

The engine='python' argument specifies that the filtering should be done using Python's built-in regular expression engine, rather than the default engine used by Pandas. This can be useful in cases where the default engine doesn't support certain regex features or is too slow.

The .head(10) method is used to return only the first 10 rows of the filtered DataFrame.

So, overall, this code is filtering the df_customer DataFrame to only include rows where the status_cd column starts with a letter from A to F, and then returning the first 10 rows of the resulting DataFrame.

P-014: From the customer data (df_customer), extract all data items whose status code (status_cd) ends with the numbers 1 to 9, and display 10 items.

In [ ]:

# A regex option can also be added (if False, it is treated as a string as is, not as a regular expression).
df_customer.query("status_cd.str.contains(r'[1-9]$', regex=True)", 
                  engine='python').head(10)

Out[ ]:

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
4	CS001215000145	田崎美紀	1	女性	1995-03-29	24	144-0055	東京都大田区仲六郷**********	S13001	20170605	6-20090929-2
9	CS033513000180	安斎遥	1	女性	1962-07-11	56	241-0823	神奈川県横浜市旭区善部町**********	S14033	20150728	6-20080506-5
12	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
14	CS040412000191	川井郁恵	1	女性	1977-01-05	42	226-0021	神奈川県横浜市緑区北八朔町**********	S14040	20151101	1-20091025-4
16	CS009315000023	皆川文世	1	女性	1980-04-15	38	154-0012	東京都世田谷区駒沢**********	S13009	20150319	5-20080322-1
22	CS015315000033	福士璃奈子	1	女性	1983-03-17	36	135-0043	東京都江東区塩浜**********	S13015	20141024	4-20080219-3
23	CS023513000066	神戸そら	1	女性	1961-12-17	57	210-0005	神奈川県川崎市川崎区東田町**********	S14023	20150915	5-20100524-9
24	CS035513000134	市川美帆	1	女性	1960-03-27	59	156-0053	東京都世田谷区桜**********	S13035	20150227	8-20100711-9
27	CS001515000263	高松夏空	1	女性	1962-11-09	56	144-0051	東京都大田区西蒲田**********	S13001	20160812	1-20100804-1
28	CS040314000027	鶴田きみまろ	9	不明	1986-03-26	33	226-0027	神奈川県横浜市緑区長津田**********	S14040	20150122	2-20080426-4

Commentary : 

df_customer is a DataFrame object.

.query() is a method provided by Pandas for filtering rows based on a specific condition. The condition is specified as a string argument.

The condition in this case is "status_cd.str.contains(r'[1-9]$', regex=True)". This uses the str.contains() method to check whether the status_cd column of the DataFrame contains any strings that end with a digit from 1 to 9 (inclusive). The $ character represents the end of the string, and the regex=True argument indicates that the pattern should be interpreted as a regular expression.

The engine='python' argument specifies that the filtering should be done using Python's built-in regular expression engine, rather than the default engine used by Pandas. This can be useful in cases where the default engine doesn't support certain regex features or is too slow.

The .head(10) method is used to return only the first 10 rows of the filtered DataFrame.So, overall, this code is filtering the df_customer DataFrame to only include rows where the status_cd column ends with a digit from 1 to 9, and then returning the first 10 rows of the resulting DataFrame.

P-015: From the customer data (df_customer), extract all data items whose status code (status_cd) begins with the letters A to F and ends with the numbers 1 to 9, and display 10 items.

In [ ]:

df_customer.query("status_cd.str.contains(r'^[A-F].*[1-9]$')", 
                  engine='python').head(10)

Out[ ]:

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
12	CS011215000048	芦田沙耶	1	女性	1992-02-01	27	223-0062	神奈川県横浜市港北区日吉本町**********	S14011	20150228	C-20100421-9
68	CS022513000105	島村貴美子	1	女性	1962-03-12	57	249-0002	神奈川県逗子市山の根**********	S14022	20150320	A-20091115-7
71	CS001515000096	水野陽子	9	不明	1960-11-29	58	144-0053	東京都大田区蒲田本町**********	S13001	20150614	A-20100724-7
122	CS013615000053	西脇季衣	1	女性	1953-10-18	65	261-0026	千葉県千葉市美浜区幕張西**********	S12013	20150128	B-20100329-6
144	CS020412000161	小宮薫	1	女性	1974-05-21	44	174-0042	東京都板橋区東坂下**********	S13020	20150822	B-20081021-3
178	CS001215000097	竹中あさみ	1	女性	1990-07-25	28	146-0095	東京都大田区多摩川**********	S13001	20170315	A-20100211-2
252	CS035212000007	内村恵梨香	1	女性	1990-12-04	28	152-0023	東京都目黒区八雲**********	S13035	20151013	B-20101018-6
259	CS002515000386	野田コウ	1	女性	1963-05-30	55	185-0013	東京都国分寺市西恋ケ窪**********	S13002	20160410	C-20100127-8
293	CS001615000372	稲垣寿々花	1	女性	1956-10-29	62	144-0035	東京都大田区南蒲田**********	S13001	20170403	A-20100104-1
297	CS032512000121	松井知世	1	女性	1962-09-04	56	210-0011	神奈川県川崎市川崎区富士見**********	S13032	20150727	A-20100103-5

Commentary : 

This code uses the Pandas library in Python to filter rows from a DataFrame named df_customer. The filter is based on a condition specified in the query() method, which takes a string argument representing a Boolean expression.

status_cd: This is assumed to be a column in the df_customer DataFrame.

.: This is the dot notation used to access a method or attribute of an object in Python.

str: This refers to the string methods of the status_cd column.

contains(): This is a string method that checks whether a given substring or regular expression is present in each element of the column.

r'^[A-F].*[1-9]$': This is a regular expression pattern that matches strings that start with a letter A through F (inclusive) and end with a digit from 1 through 9 (inclusive). The r prefix before the pattern indicates that it is a raw string, which means that backslashes are treated as literal backslashes rather than escape characters.

The query() method is also specifying an engine argument with the value 'python'. This is because the default engine used by Pandas is numexpr, which can be faster for some types of queries but may not support all types of regular expressions. Using the 'python' engine ensures that the regular expression is handled by the Python regex module.

Finally, the .head(10) method call limits the output to the first 10 rows that match the filter condition.

P-016: From the shop data (df_store), display all items of data where the telephone number (tel_no) is 3 digits – 3 digits – 4 digits.

In [ ]:

df_store.query("tel_no.str.contains(r'^[0-9]{3}-[0-9]{3}-[0-9]{4}$')", 
               engine='python')

Out[ ]:

	store_cd	store_name	prefecture_cd	prefecture	address	address_kana	tel_no	longitude	latitude	floor_area
0	S12014	千草台店	12	千葉県	千葉県千葉市稲毛区千草台一丁目	チバケンチバシイナゲクチグサダイイッチョウメ	043-123-4003	140.1180	35.63559	1698.0
1	S13002	国分寺店	13	東京都	東京都国分寺市本多二丁目	トウキョウトコクブンジシホンダニチョウメ	042-123-4008	139.4802	35.70566	1735.0
2	S14010	菊名店	14	神奈川県	神奈川県横浜市港北区菊名一丁目	カナガワケンヨコハマシコウホククキクナイッチョウメ	045-123-4032	139.6326	35.50049	1732.0
3	S14033	阿久和店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4043	139.4961	35.45918	1495.0
4	S14036	相模原中央店	14	神奈川県	神奈川県相模原市中央二丁目	カナガワケンサガミハラシチュウオウニチョウメ	042-123-4045	139.3716	35.57327	1679.0
7	S14040	長津田店	14	神奈川県	神奈川県横浜市緑区長津田みなみ台五丁目	カナガワケンヨコハマシミドリクナガツタミナミダイゴチョウメ	045-123-4046	139.4994	35.52398	1548.0
9	S14050	阿久和西店	14	神奈川県	神奈川県横浜市瀬谷区阿久和西一丁目	カナガワケンヨコハマシセヤクアクワニシイッチョウメ	045-123-4053	139.4961	35.45918	1830.0
11	S13052	森野店	13	東京都	東京都町田市森野三丁目	トウキョウトマチダシモリノサンチョウメ	042-123-4030	139.4383	35.55293	1087.0
12	S14028	二ツ橋店	14	神奈川県	神奈川県横浜市瀬谷区二ツ橋町	カナガワケンヨコハマシセヤクフタツバシチョウ	045-123-4042	139.4963	35.46304	1574.0
16	S14012	本牧和田店	14	神奈川県	神奈川県横浜市中区本牧和田	カナガワケンヨコハマシナカクホンモクワダ	045-123-4034	139.6582	35.42156	1341.0
18	S14046	北山田店	14	神奈川県	神奈川県横浜市都筑区北山田一丁目	カナガワケンヨコハマシツヅキクキタヤマタイッチョウメ	045-123-4049	139.5916	35.56189	831.0
19	S14022	逗子店	14	神奈川県	神奈川県逗子市逗子一丁目	カナガワケンズシシズシイッチョウメ	046-123-4036	139.5789	35.29642	1838.0
20	S14011	日吉本町店	14	神奈川県	神奈川県横浜市港北区日吉本町四丁目	カナガワケンヨコハマシコウホククヒヨシホンチョウヨンチョウメ	045-123-4033	139.6316	35.54655	890.0
21	S13016	小金井店	13	東京都	東京都小金井市本町一丁目	トウキョウトコガネイシホンチョウイッチョウメ	042-123-4015	139.5094	35.70018	1399.0
22	S14034	川崎野川店	14	神奈川県	神奈川県川崎市宮前区野川	カナガワケンカワサキシミヤマエクノガワ	044-123-4044	139.5998	35.57693	1318.0
26	S14048	中川中央店	14	神奈川県	神奈川県横浜市都筑区中川中央二丁目	カナガワケンヨコハマシツヅキクナカガワチュウオウニチョウメ	045-123-4051	139.5758	35.54912	1657.0
27	S12007	佐倉店	12	千葉県	千葉県佐倉市上志津	チバケンサクラシカミシヅ	043-123-4001	140.1452	35.71872	1895.0
28	S14026	辻堂西海岸店	14	神奈川県	神奈川県藤沢市辻堂西海岸二丁目	カナガワケンフジサワシツジドウニシカイガンニチョウメ	046-123-4040	139.4466	35.32464	1732.0
29	S13041	八王子店	13	東京都	東京都八王子市大塚	トウキョウトハチオウジシオオツカ	042-123-4026	139.4235	35.63787	810.0
31	S14049	川崎大師店	14	神奈川県	神奈川県川崎市川崎区中瀬三丁目	カナガワケンカワサキシカワサキクナカゼサンチョウメ	044-123-4052	139.7327	35.53759	962.0
32	S14023	川崎店	14	神奈川県	神奈川県川崎市川崎区本町二丁目	カナガワケンカワサキシカワサキクホンチョウニチョウメ	044-123-4037	139.7028	35.53599	1804.0
33	S13018	清瀬店	13	東京都	東京都清瀬市松山一丁目	トウキョウトキヨセシマツヤマイッチョウメ	042-123-4017	139.5178	35.76885	1220.0
35	S14027	南藤沢店	14	神奈川県	神奈川県藤沢市南藤沢	カナガワケンフジサワシミナミフジサワ	046-123-4041	139.4896	35.33762	1521.0
36	S14021	伊勢原店	14	神奈川県	神奈川県伊勢原市伊勢原四丁目	カナガワケンイセハラシイセハラヨンチョウメ	046-123-4035	139.3129	35.40169	962.0
37	S14047	相模原店	14	神奈川県	神奈川県相模原市千代田六丁目	カナガワケンサガミハラシチヨダロクチョウメ	042-123-4050	139.3748	35.55959	1047.0
38	S12013	習志野店	12	千葉県	千葉県習志野市芝園一丁目	チバケンナラシノシシバゾノイッチョウメ	047-123-4002	140.0220	35.66122	808.0
40	S14042	新山下店	14	神奈川県	神奈川県横浜市中区新山下二丁目	カナガワケンヨコハマシナカクシンヤマシタニチョウメ	045-123-4047	139.6593	35.43894	1044.0
42	S12030	八幡店	12	千葉県	千葉県市川市八幡三丁目	チバケンイチカワシヤワタサンチョウメ	047-123-4005	139.9240	35.72318	1162.0
44	S14025	大和店	14	神奈川県	神奈川県大和市下和田	カナガワケンヤマトシシモワダ	046-123-4039	139.4680	35.43414	1011.0
45	S14045	厚木店	14	神奈川県	神奈川県厚木市中町二丁目	カナガワケンアツギシナカチョウニチョウメ	046-123-4048	139.3651	35.44182	980.0
47	S12029	東野店	12	千葉県	千葉県浦安市東野一丁目	チバケンウラヤスシヒガシノイッチョウメ	047-123-4004	139.8968	35.65086	1101.0
49	S12053	高洲店	12	千葉県	千葉県浦安市高洲五丁目	チバケンウラヤスシタカスゴチョウメ	047-123-4006	139.9176	35.63755	1555.0
51	S14024	三田店	14	神奈川県	神奈川県川崎市多摩区三田四丁目	カナガワケンカワサキシタマクミタヨンチョウメ	044-123-4038	139.5424	35.60770	972.0
52	S14006	葛が谷店	14	神奈川県	神奈川県横浜市都筑区葛が谷	カナガワケンヨコハマシツヅキククズガヤ	045-123-4031	139.5633	35.53573	1886.0

Commentary : 

This code uses the Pandas library in Python to filter rows from a DataFrame named df_store. The filter is based on a condition specified in the query() method, which takes a string argument representing a Boolean expression.

tel_no: This is assumed to be a column in the df_store DataFrame.

.: This is the dot notation used to access a method or attribute of an object in Python.

str: This refers to the string methods of the tel_no column.

contains(): This is a string method that checks whether a given substring or regular expression is present in each element of the column.

r'^[0-9]{3}-[0-9]{3}-[0-9]{4}$': This is a regular expression pattern that matches strings that have the format of a North American phone number, which consists of three digits, a hyphen, three more digits, another hyphen, and four more digits. The r prefix before the pattern indicates that it is a raw string, which means that backslashes are treated as literal backslashes rather than escape characters. The curly braces {} indicate that the preceding pattern should be repeated a specified number of times, and the $ at the end signifies the end of the string.

The query() method is also specifying an engine argument with the value 'python'. This is because the default engine used by Pandas is numexpr, which can be faster for some types of queries but may not support all types of regular expressions. Using the 'python' engine ensures that the regular expression is handled by the Python regex module.

The resulting filter will keep only those rows from the df_store DataFrame that have a valid North American phone number in the tel_no column.

P-017: Sort customer data (df_customer) by birth_day(birth_day) in order of age and display all 10 items from the top.

In [ ]:

df_customer.sort_values('birth_day').head(10)

Out[ ]:

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
18817	CS003813000014	村山菜々美	1	女性	1928-11-26	90	182-0007	東京都調布市菊野台**********	S13003	20160214	0-00000000-0
12328	CS026813000004	吉村朝陽	1	女性	1928-12-14	90	251-0043	神奈川県藤沢市辻堂元町**********	S14026	20150723	0-00000000-0
15682	CS018811000003	熊沢美里	1	女性	1929-01-07	90	204-0004	東京都清瀬市野塩**********	S13018	20150403	0-00000000-0
15302	CS027803000004	内村拓郎	0	男性	1929-01-12	90	251-0031	神奈川県藤沢市鵠沼藤が谷**********	S14027	20151227	0-00000000-0
1681	CS013801000003	天野拓郎	0	男性	1929-01-15	90	274-0824	千葉県船橋市前原東**********	S12013	20160120	0-00000000-0
7511	CS001814000022	鶴田里穂	1	女性	1929-01-28	90	144-0045	東京都大田区南六郷**********	S13001	20161012	A-20090415-7
2378	CS016815000002	山元美紀	1	女性	1929-02-22	90	184-0005	東京都小金井市桜町**********	S13016	20150629	C-20090923-C
4680	CS009815000003	中田里穂	1	女性	1929-04-08	89	154-0014	東京都世田谷区新町**********	S13009	20150421	D-20091021-E
16070	CS005813000015	金谷恵梨香	1	女性	1929-04-09	89	165-0032	東京都中野区鷺宮**********	S13005	20150506	0-00000000-0
6305	CS012813000013	宇野南朋	1	女性	1929-04-09	89	231-0806	神奈川県横浜市中区本牧町**********	S14012	20150712	0-00000000-0

Commentary : 

This code uses the Pandas library in Python to sort rows of a DataFrame named df_customer based on the values in the birth_day column in ascending order. The resulting sorted DataFrame is then limited to the first 10 rows using the head() method.

df_customer: This is the name of the DataFrame that is being sorted.

sort_values(): This is a Pandas method that is used to sort a DataFrame based on the values in one or more columns. In this case, the birth_day column is specified as the sort key.

'birth_day': This is the name of the column that is used as the sort key.

.head(10): This is a method call that limits the output to the first 10 rows of the sorted DataFrame.

The resulting output will be the first 10 rows of the df_customer DataFrame, sorted in ascending order based on the values in the birth_day column. This will allow you to see the oldest 10 customers in the dataset, assuming that birth_day represents the customers' birth dates.

P-018: Sort customer data (df_customer) by birth_day(birth_day) in descending order, displaying all 10 items from the top.

In [ ]:

df_customer.sort_values('birth_day', ascending=False).head(10)

Out[ ]:

	customer_id	customer_name	gender_cd	gender	birth_day	age	postal_cd	address	application_store_cd	application_date	status_cd
15639	CS035114000004	大村美里	1	女性	2007-11-25	11	156-0053	東京都世田谷区桜**********	S13035	20150619	6-20091205-6
7468	CS022103000002	福山はじめ	9	不明	2007-10-02	11	249-0006	神奈川県逗子市逗子**********	S14022	20160909	0-00000000-0
10745	CS002113000009	柴田真悠子	1	女性	2007-09-17	11	184-0014	東京都小金井市貫井南町**********	S13002	20160304	0-00000000-0
19811	CS004115000014	松井京子	1	女性	2007-08-09	11	165-0031	東京都中野区上鷺宮**********	S13004	20161120	1-20081231-1
7039	CS002114000010	山内遥	1	女性	2007-06-03	11	184-0015	東京都小金井市貫井北町**********	S13002	20160920	6-20100510-1
3670	CS025115000002	小柳夏希	1	女性	2007-04-18	11	245-0018	神奈川県横浜市泉区上飯田町**********	S14025	20160116	D-20100913-D
12493	CS002113000025	広末まなみ	1	女性	2007-03-30	12	184-0015	東京都小金井市貫井北町**********	S13002	20171030	0-00000000-0
15977	CS033112000003	長野美紀	1	女性	2007-03-22	12	245-0051	神奈川県横浜市戸塚区名瀬町**********	S14033	20150606	0-00000000-0
5716	CS007115000006	福岡瞬	1	女性	2007-03-10	12	285-0845	千葉県佐倉市西志津**********	S12007	20151118	F-20101016-F
15097	CS014113000008	矢口莉緒	1	女性	2007-03-05	12	260-0041	千葉県千葉市中央区東千葉**********	S12014	20150622	3-20091108-6

Commentary : 

This code uses the Pandas library in Python to sort rows of a DataFrame named df_customer based on the values in the birth_day column in descending order. The resulting sorted DataFrame is then limited to the first 10 rows using the head() method.

df_customer: This is the name of the DataFrame that is being sorted.

sort_values(): This is a Pandas method that is used to sort a DataFrame based on the values in one or more columns. In this case, the birth_day column is specified as the sort key.

'birth_day': This is the name of the column that is used as the sort key.

ascending=False: This argument is used to specify that the sort order should be in descending order rather than the default ascending order.

.head(10): This is a method call that limits the output to the first 10 rows of the sorted DataFrame.

The resulting output will be the first 10 rows of the df_customer DataFrame, sorted in descending order based on the values in the birth_day column. This will allow you to see the youngest 10 customers in the dataset, assuming that birth_day represents the customers' birth dates.

P-019: Rank the receipt details data (df_receipt) in descending order of sales amount per item (amount) and display 10 items from the top. The items shall display the customer ID (customer_id), sales amount (amount) and the rank assigned. If the sales amount (amount) is equal, the same rank shall be assigned.

In [ ]:

df_tmp = pd.concat([df_receipt[['customer_id', 'amount']] 
                     ,df_receipt['amount'].rank(method='min', 
                                                ascending=False)], axis=1)

df_tmp.columns = ['customer_id', 'amount', 'ranking']

df_tmp.sort_values('ranking').head(10)

Out[ ]:

	customer_id	amount	ranking
1202	CS011415000006	10925	1.0
62317	ZZ000000000000	6800	2.0
54095	CS028605000002	5780	3.0
4632	CS015515000034	5480	4.0
72747	ZZ000000000000	5480	4.0
10320	ZZ000000000000	5480	4.0
97294	CS021515000089	5440	7.0
28304	ZZ000000000000	5440	7.0
92246	CS009415000038	5280	9.0
68553	CS040415000200	5280	9.0

Commentary : 

The given code is performing the following operations:

Concatenating two dataframes df_receipt[['customer_id', 'amount']] and df_receipt['amount'].rank(method='min', ascending=False) along the columns axis (axis=1) using the pd.concat() function. The resulting dataframe is assigned to the variable df_tmp. The first dataframe df_receipt[['customer_id', 'amount']] contains two columns 'customer_id' and 'amount' from a larger dataframe df_receipt, whereas the second dataframe is the rank of the 'amount' column calculated using the 'min' method in descending order.

Renaming the columns of the concatenated dataframe df_tmp to 'customer_id', 'amount', and 'ranking' using the df_tmp.columns attribute.

Sorting the df_tmp dataframe by the 'ranking' column in ascending order using the sort_values() method and returning the top 10 rows using the head(10) method.

Overall, the code is concatenating the 'customer_id' and 'amount' columns from the df_receipt dataframe with the rank of 'amount' calculated using the 'min' method in descending order. The resulting dataframe is sorted by the 'ranking' column and returns the top 10 rows with the highest rank, which represents the customers who spent the most on their purchases.

P-020: For receipt details data (df_receipt), assign ranks in descending order of sales amount (amount) per item and display the first 10 items. Items shall display the customer ID (customer_id), sales amount (amount) and the rank assigned. Even if the sales amount (amount) is equal, a different rank should be assigned.

In [ ]:

df_tmp = pd.concat([df_receipt[['customer_id', 'amount']] 
                     ,df_receipt['amount'].rank(method='first', 
                                                ascending=False)], axis=1)

df_tmp.columns = ['customer_id', 'amount', 'ranking']

df_tmp.sort_values('ranking').head(10)

Out[ ]:

	customer_id	amount	ranking
1202	CS011415000006	10925	1.0
62317	ZZ000000000000	6800	2.0
54095	CS028605000002	5780	3.0
4632	CS015515000034	5480	4.0
10320	ZZ000000000000	5480	5.0
72747	ZZ000000000000	5480	6.0
28304	ZZ000000000000	5440	7.0
97294	CS021515000089	5440	8.0
596	CS015515000083	5280	9.0
11275	CS017414000114	5280	10.0

Commentary : 

The given code is performing the following operations:

Concatenating two dataframes df_receipt[['customer_id', 'amount']] and df_receipt['amount'].rank(method='first', ascending=False) along the columns axis (axis=1) using the pd.concat() function. The resulting dataframe is assigned to the variable df_tmp. The first dataframe df_receipt[['customer_id', 'amount']] contains two columns 'customer_id' and 'amount' from a larger dataframe df_receipt, whereas the second dataframe is the rank of the 'amount' column calculated using the 'first' method in descending order.

Renaming the columns of the concatenated dataframe df_tmp to 'customer_id', 'amount', and 'ranking' using the df_tmp.columns attribute.

Sorting the df_tmp dataframe by the 'ranking' column in ascending order using the sort_values() method and returning the top 10 rows using the head(10) method.

Overall, the code is concatenating the 'customer_id' and 'amount' columns from the df_receipt dataframe with the rank of 'amount' calculated using the 'first' method in descending order. The resulting dataframe is sorted by the 'ranking' column and returns the top 10 rows with the highest rank, which represents the customers who spent the most on their purchases. The only difference between this code and the previous one is the rank method used. The 'first' method is used here, which assigns the same rank to the equal values of 'amount' column whereas the 'min' method assigns the lowest rank to the equal values of 'amount' column.

Data Science 100 Knocks (Structured Data Processing) - Python

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)

Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)

Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)

Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)

Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)

Data Science 100 Knocks (Structured Data Processing)

This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.