Data Science 100 Knocks (Structured Data Processing) – Python Part5 (Q81 to Q100)

Articles in English
 
You can get preprocess_knock files to practice python, SQL, R from my github account
 
ChanhiYasutomi - Repositories
ChanhiYasutomi has 19 repositories available. Follow their code on GitHub.
 
Commentary :

This code performs two operations on a pandas DataFrame named df_product:

It fills missing values in the unit_price and unit_cost columns with the rounded mean of each column using the fillna() method. The rounded mean is calculated using the np.round() and np.nanmean() functions from the NumPy library. The fillna() method takes a dictionary as its argument, where the keys are the column names and the values are the values to fill missing values with.

It then checks for any remaining missing values in the DataFrame using the isnull() method, and sums up the number of missing values in each column using the sum() method. The resulting output is the number of missing values in each column of the updated DataFrame df_product_2.

The updated DataFrame df_product_2 will have missing values in the unit_price and unit_cost columns replaced with their respective rounded mean values, and the output of the second line will show how many missing values remain in the DataFrame after the fillna operation.
 
 
Commentary :

The code first fills the missing values in two columns, "unit_price" and "unit_cost", of the DataFrame named "df_product". It fills the missing values using the fillna method with the following arguments:

The first argument is a dictionary with two keys, "unit_price" and "unit_cost", each mapping to a corresponding value. These values are computed using the NumPy library's nanmedian function applied to the respective columns of the DataFrame. nanmedian calculates the median value of the column after ignoring any missing values represented by NaN.

The second argument is omitted in this code, meaning that any missing values in the other columns of the DataFrame are left untouched.

The result of this operation is a new DataFrame called "df_product_3".

The next line of code checks if there are still any missing values in the "df_product_3" DataFrame by calling the isnull() method and then the sum() method. This returns a Series object that contains the number of missing values for each column of the DataFrame. If the output shows that there are no missing values in either column, it confirms that the fillna operation was successful.
 
 
Commentary :

This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:

The first line groups the DataFrame "df_product" by the "category_small_cd" column and computes the median values of "unit_price" and "unit_cost" for each group. The resulting DataFrame is named "df_tmp" and has three columns: "category_small_cd", "median_price", and "median_cost".

The second line uses the merge() method of the Pandas library to merge the "df_product" and "df_tmp" DataFrames based on the "category_small_cd" column. The resulting DataFrame, "df_product_4", contains all the columns of "df_product" and the "median_price" and "median_cost" columns from "df_tmp".

The third line creates a new column in "df_product_4" named "unit_price" that is computed by applying a lambda function to the "unit_price" and "median_price" columns. The lambda function checks if the "unit_price" column is missing (i.e., NaN) and, if so, returns the rounded value of "median_price". Otherwise, it returns the original value of "unit_price". This operation fills the missing values in the "unit_price" column.

The fourth line creates a new column in "df_product_4" named "unit_cost" that is computed by applying a similar lambda function to the "unit_cost" and "median_cost" columns. This operation fills the missing values in the "unit_cost" column.

The last line checks if there are still any missing values in the "df_product_4" DataFrame by calling the isnull() method and then the sum() method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
 
 
Commentary :

This code performs imputation of missing values in the "unit_price" and "unit_cost" columns of the DataFrame "df_product" using the median value of each category. The code achieves this in the following steps:

The first line groups the DataFrame "df_product" by the "category_small_cd" column and computes the median values of "unit_price" and "unit_cost" for each group. The resulting DataFrame is named "df_tmp" and has three columns: "category_small_cd", "median_price", and "median_cost".

The second line uses the merge() method of the Pandas library to merge the "df_product" and "df_tmp" DataFrames based on the "category_small_cd" column. The resulting DataFrame, "df_product_4", contains all the columns of "df_product" and the "median_price" and "median_cost" columns from "df_tmp".

The third line creates a new column in "df_product_4" named "unit_price" that is computed using the mask() method of Pandas. The mask() method sets the missing values (i.e., NaN) in "unit_price" to the rounded value of "median_price". This operation fills the missing values in the "unit_price" column.

The fourth line creates a new column in "df_product_4" named "unit_cost" that is computed in a similar way to "unit_price". This operation fills the missing values in the "unit_cost" column.

The last line checks if there are still any missing values in the "df_product_4" DataFrame by calling the isnull() method and then the sum() method. If the output shows that there are no missing values in either column, it confirms that the imputation operation was successful.
 
 
Commentary :

This code performs some geospatial calculations and data processing on three Pandas DataFrames called df_customer_1, df_store, and df_tmp. Here's a step-by-step explanation of what the code does:

The first line defines a Python function called calc_distance that takes four arguments: x1 and y1 (latitude and longitude of point 1) and x2 and y2 (latitude and longitude of point 2). The function uses the Haversine formula to calculate the distance between the two points in kilometers, assuming a spherical earth with radius of 6371 km. The result is returned as distance.

The second line creates a new DataFrame called df_tmp by merging df_customer_1 and df_store on the application_store_cd and store_cd columns, respectively. The merge is an inner join, meaning that only the rows that have matching values in both DataFrames are included in the resulting DataFrame. The resulting DataFrame has all the columns of both DataFrames, plus some renamed columns: address_x is renamed to customer_address, and address_y is renamed to store_address.

The third line creates a new column in df_tmp called distance by applying the calc_distance function to the m_latitude, m_longitude, latitude, and longitude columns using the .apply() method. The .apply() method is used with the axis=1 argument to apply the function row-wise. The resulting distance value is stored in the distance column.

The fourth line selects a subset of columns from df_tmp (customer_id, customer_address, store_address, and distance) using double square brackets and displays the first 10 rows of the resulting DataFrame using the .head(10) method. This shows a sample of the merged data that includes customer and store information with the corresponding distances between them.
 
 
Commentary :

The code is performing the following tasks:

df_ts_amount dataframe is created with columns 'sales_ymd', 'amount', and 'sales_ym'. sales_ym column is created from the first six characters of the sales_ymd column converted to a string.

Grouping the df_ts_amount dataframe by 'sales_ym' column and aggregating the 'amount' column with 'sum' function, and resetting the index. The resulting dataframe is stored in the same variable 'df_ts_amount'.

A function split_data is defined that takes in a dataframe df, train size, test size, slide window, and start point as arguments. It returns two dataframes - train and test dataframes.

The split_data function is called three times with different arguments each time to split the df_ts_amount dataframe into train and test dataframes. Three sets of train and test dataframes are created and stored in df_train_1, df_test_1, df_train_2, df_test_2, df_train_3, and df_test_3.

The purpose of splitting the data into multiple sets is probably to use them as training and validation sets for a time series forecasting model. By splitting the data into different sets, the model can be trained on different time periods, and the performance can be evaluated on different test periods. This helps to avoid overfitting and to get a more accurate estimate of the model's performance.
 
 
Commentary : 

This code exports the Pandas DataFrame df_product_full to a CSV file with a specified file path and name ../data/P_df_product_full_UTF-8_header.csv. The exported CSV file will use UTF-8 encoding to support a wide range of characters, and the argument index=False specifies that the row index of the DataFrame should not be included in the exported CSV file.
 
 
 
 
 
 
Commentary :

This code reads a CSV file named 'P_df_product_full_UTF-8_noh.csv' located in the '../data' directory. It specifies the column names using the 'names' parameter as a list of strings called 'c_names'. The 'dtype' parameter is used to specify the data type of the columns named 'category_major_cd', 'category_medium_cd', and 'category_small_cd' as 'str' (string). The 'encoding' parameter specifies that the CSV file is encoded using 'UTF-8'. Finally, the 'header' parameter is set to 'None' to indicate that the file does not have a header row.

The resulting dataframe is assigned to the variable 'df_product_full' and the first three rows are displayed using the 'head' method.
 
 
 
Commentary :

The code reads data from a TSV (Tab-Separated Values) file located at '../data/P_df_product_full_UTF-8_header.tsv' into a pandas DataFrame called df_product_full.

The pd.read_table() function is called to read the data from the file. The first argument to the function, '../data/P_df_product_full_UTF-8_header.tsv', specifies the path and name of the file to read.

The second argument to the function, dtype={'category_major_cd':str, 'category_medium_cd':str, 'category_small_cd':str}, specifies the data types for certain columns in the DataFrame. Specifically, the columns category_major_cd, category_medium_cd, and category_small_cd are being set to be of type string (str).

The third argument to the function, encoding='UTF-8', specifies that the encoding of the file is UTF-8.

The resulting DataFrame, df_product_full, is then displayed with the head(3) method, which shows the first three rows of the DataFrame.
 
 

 

Data Science 100 Knocks (Structured Data Processing) - Python
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Comment