Data Science 100 Knocks (Structured Data Processing) – Python Part2 (Q21 to Q40)

Articles in English
 
You can get preprocess_knock files to practice python, SQL, R from my github account
 
ChanhiYasutomi - Repositories
ChanhiYasutomi has 19 repositories available. Follow their code on GitHub.
 
 
Commentary :

The code len(df_receipt) returns the length of the DataFrame df_receipt.

Here's an explanation of each component:

len(): This is a built-in Python function that returns the length of an object, such as a list or a string.

df_receipt: This is a variable that represents a DataFrame object in pandas. The name df_receipt could be any valid variable name.

The combination of len() and df_receipt: By passing the DataFrame df_receipt as an argument to the len() function, we are asking Python to return the number of rows in df_receipt.

In summary, the code is simply returning the number of rows in a DataFrame.
 
 
 
 
 
 
 
Commentary :

The code np.percentile(df_receipt['amount'], q=np.arange(1, 5) * 25) computes the quartiles of the values in the amount column of the df_receipt DataFrame using the np.percentile() function from the NumPy library.

Here's an explanation of each component:

np.percentile(): This is a NumPy function that computes the percentiles of an array. It takes two arguments: the first argument is the array for which percentiles are to be calculated, and the second argument is the percentile values to compute. The q parameter is used to specify the percentiles to compute. For example, q=np.arange(1, 5) * 25 computes the percentiles at positions 25, 50, 75 in the distribution of the data. This is because the np.arange(1, 5) * 25 expression generates an array with values [25, 50, 75], which are the positions of the percentiles in the distribution.

df_receipt['amount']: This is a pandas Series object that represents the amount column of the df_receipt DataFrame. The df_receipt DataFrame is assumed to exist in the current environment.

q=np.arange(1, 5) * 25: This is a NumPy array that specifies the percentiles to compute. The np.arange(1, 5) * 25 expression generates an array with values [25, 50, 75], which are the positions of the percentiles in the distribution.

The code computes the quartiles of the amount column of the df_receipt DataFrame, which are the values that divide the distribution into four equal parts. The first quartile (Q1) is the 25th percentile, the second quartile (Q2) is the 50th percentile (also known as the median), and the third quartile (Q3) is the 75th percentile. The output of the code is an array of the quartile values, which can be used to summarize the distribution of the amount column.
 
 
Commentary : 

The code df_receipt[~df_receipt['customer_id'].str.startswith("Z")].groupby('customer_id').amount.sum().mean() does the following:

df_receipt[~df_receipt['customer_id'].str.startswith("Z")]: This filters the df_receipt DataFrame to exclude rows where the customer_id column starts with the letter "Z". The tilde (~) character is used as a negation operator to invert the boolean values returned by the str.startswith() method.

.groupby('customer_id').amount.sum(): This groups the filtered DataFrame by the customer_id column, and calculates the sum of the amount column for each group.

.mean(): This calculates the mean value of the resulting Series, which represents the total amount spent by each customer (excluding customers whose customer_id starts with "Z").

In summary, this code computes the average amount spent by each customer whose customer_id does not start with the letter "Z" in the df_receipt DataFrame. This may be useful for understanding the purchasing behavior of regular customers.
 
 
In [ ]:
df_store_tmp = df_store.copy()
df_product_tmp = df_product.copy()

df_store_tmp['key'] = 0
df_product_tmp['key'] = 0

len(pd.merge(df_store_tmp, df_product_tmp, how='outer', on='key'))

 

Commentary :

This code merges the two dataframes df_store and df_product using an outer join.

First, two temporary dataframes df_store_tmp and df_product_tmp are created as copies of df_store and df_product, respectively. Then, a new column called key is added to both dataframes with a constant value of 0. This is done because an outer join requires a common column to merge on.

Finally, the two dataframes are merged using the pd.merge() function, with how='outer' argument indicating an outer join is desired and on='key' specifying the common column to merge on. The resulting dataframe contains all rows from both dataframes, with missing values (NaN) in the cells where there was no corresponding data in the other dataframe.

Lastly, len() is used to get the number of rows in the merged dataframe. Since both dataframes have the same number of rows, the output would be the same whether the inner, left, right, or outer join was used. In this case, it will output the total number of rows in the merged dataframe.
 
Data Science 100 Knocks (Structured Data Processing) - Python
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Comment