Data Science 100 Knocks (Structured Data Processing) – Python Part1 (Q1 to Q20)

Articles in English
 
You can get preprocess_knock files to practice python, SQL, R from my github account
 
ChanhiYasutomi - Repositories
ChanhiYasutomi has 19 repositories available. Follow their code on GitHub.
 
 
Commentary : 

The code df_receipt.head(10) is used to display the first 10 rows of a Pandas DataFrame called df_receipt.

In Pandas, a DataFrame is a two-dimensional labeled data structure with columns of potentially different data types. It is similar to a spreadsheet or SQL table.

The .head() method in Pandas is used to display the first n rows of a DataFrame. By default, it displays the first 5 rows. In this case, .head(10) is used to display the first 10 rows of the DataFrame.
 
 
Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].head(10) is used to display the first 10 rows of a subset of columns from a Pandas DataFrame called df_receipt.

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The resulting DataFrame contains only these selected columns. The .head(10) method is then used to display the first 10 rows of this subset DataFrame.
 
 
 
Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].query('customer_id == "CS018205000001"') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001).

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on a specific condition. In this case, the condition is that the customer_id column must be equal to the value "CS018205000001". The backslash (\) at the end of the first line indicates that the code is continued on the next line for better readability.

The resulting DataFrame contains only the rows where the customer_id column matches the specified value.
 
 

 

Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']].query('customer_id == "CS018205000001"') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001).

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on a specific condition. In this case, the condition is that the customer_id column must be equal to the value "CS018205000001". The backslash (\) at the end of the first line indicates that the code is continued on the next line for better readability.

The resulting DataFrame contains only the rows where the customer_id column matches the specified value.
 
 
Commentary :

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & amount >= 1000') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001) and the amount column is greater than or equal to 1000.

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on multiple conditions. The & symbol is used to indicate a logical AND operation between the two conditions. The first condition is that the customer_id column must be equal to the value "CS018205000001". The second condition is that the amount column must be greater than or equal to 1000.

The resulting DataFrame contains only the rows where both conditions are satisfied.
 
Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']].query('customer_id == "CS018205000001" & (amount >= 1000 | quantity >=5)') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001) and either the amount column is greater than or equal to 1000 or the quantity column is greater than or equal to 5.

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'quantity', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, quantity, and amount.

The .query() method is then used to filter the rows of the DataFrame based on multiple conditions. The & symbol is used to indicate a logical AND operation between the two conditions. The first condition is that the customer_id column must be equal to the value "CS018205000001". The second condition is enclosed in parentheses and contains two sub-conditions separated by the | symbol, which indicates a logical OR operation. The first sub-condition is that the amount column must be greater than or equal to 1000. The second sub-condition is that the quantity column must be greater than or equal to 5.

The resulting DataFrame contains only the rows where both the customer_id is equal to the specified value and either the amount is greater than or equal to 1000 or the quantity is greater than or equal to 5.
Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & 1000 <= amount <= 2000') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001) and the amount column is between 1000 and 2000 (inclusive).

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on multiple conditions. The & symbol is used to indicate a logical AND operation between the two conditions. The first condition is that the customer_id column must be equal to the value "CS018205000001". The second condition specifies a range of values for the amount column, using the <= and >= comparison operators.

The resulting DataFrame contains only the rows where both the customer_id is equal to the specified value and the amount is between 1000 and 2000 (inclusive).
 
Commentary : 

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] \ .query('customer_id == "CS018205000001" & product_cd != "P071401019"') is used to select a subset of rows from a Pandas DataFrame called df_receipt, where the customer_id column is equal to a specific value (CS018205000001) and the product_cd column is not equal to a specific value (P071401019).

The code df_receipt[['sales_ymd', 'customer_id', 'product_cd', 'amount']] is used to select a subset of columns from the DataFrame df_receipt. Specifically, it selects the columns named sales_ymd, customer_id, product_cd, and amount.

The .query() method is then used to filter the rows of the DataFrame based on multiple conditions. The & symbol is used to indicate a logical AND operation between the two conditions. The first condition is that the customer_id column must be equal to the value "CS018205000001". The second condition is that the product_cd column must not be equal to the value "P071401019", which is achieved using the != comparison operator.

The resulting DataFrame contains only the rows where both the customer_id is equal to the specified value and the product_cd is not equal to the specified value.
 
Commentary : 

The code df_store.query("store_cd.str.startswith('S14')", engine='python').head(10) is used to filter rows from a Pandas DataFrame called df_store, where the store_cd column starts with a specific string (S14).

The .query() method is used to filter the rows of the DataFrame based on a Boolean expression. In this case, the Boolean expression is "store_cd.str.startswith('S14')". This expression is written in Python syntax because the engine parameter is set to 'python'. The expression checks whether the store_cd column starts with the string 'S14'.

The str.startswith() method is a string method in Python that returns True if the string starts with the specified prefix. In this case, the str.startswith('S14') method is applied to each element in the store_cd column.

The resulting DataFrame contains only the rows where the store_cd column starts with the string 'S14'.

The .head(10) method is used to return only the first 10 rows of the resulting DataFrame.
 
Commentary : 

This code is selecting a subset of data from a Pandas DataFrame named df_customer. The subset is filtered based on a condition specified in the query() method.

The condition is defined by the string "customer_id.str.endswith('1')", which is passed as an argument to the query() method. This condition filters the rows of the df_customer DataFrame to only include those where the customer_id column ends with the character '1'.

The engine='python' parameter is optional and specifies that the Python engine should be used to evaluate the condition. This is necessary when the condition contains string operations, as the default engine for query() method (pandas.eval) may not support all string operations.

Finally, the head(10) method is used to return the first 10 rows of the filtered DataFrame. This is optional and is used here to limit the output to a manageable size.
 
 
Commentary : 

Specifically, the condition is defined by the string "address.str.contains('横浜市')", which is passed as an argument to query(). This string is a boolean expression that checks if the address column of df_store contains the substring '横浜市'.

The str.contains() method is a Pandas string method that returns a boolean array indicating whether each element of a given string column contains a specified substring. In this case, it is applied to the address column of df_store.

The engine='python' argument specifies that the filtering operation should be performed using Python's built-in eval() function, rather than the default numexpr engine. This is necessary when using certain types of string operations in the query expression, such as str.contains(), which are not supported by the numexpr engine.
 
 
Commentary : 

df_customer is a DataFrame object.

.query() is a method provided by Pandas for filtering rows based on a specific condition. The condition is specified as a string argument.

The condition in this case is "status_cd.str.contains(r'^[A-F]')". This uses the str.contains() method to check whether the status_cd column of the DataFrame contains any strings that start with the letters A through F (inclusive). The r before the regular expression pattern '^[A-F]' indicates a raw string literal, which ensures that the backslash character (\) is interpreted literally and not as an escape character.

The engine='python' argument specifies that the filtering should be done using Python's built-in regular expression engine, rather than the default engine used by Pandas. This can be useful in cases where the default engine doesn't support certain regex features or is too slow.

The .head(10) method is used to return only the first 10 rows of the filtered DataFrame.

So, overall, this code is filtering the df_customer DataFrame to only include rows where the status_cd column starts with a letter from A to F, and then returning the first 10 rows of the resulting DataFrame.
 
 

 

Commentary : 

df_customer is a DataFrame object.

.query() is a method provided by Pandas for filtering rows based on a specific condition. The condition is specified as a string argument.

The condition in this case is "status_cd.str.contains(r'[1-9]$', regex=True)". This uses the str.contains() method to check whether the status_cd column of the DataFrame contains any strings that end with a digit from 1 to 9 (inclusive). The $ character represents the end of the string, and the regex=True argument indicates that the pattern should be interpreted as a regular expression.

The engine='python' argument specifies that the filtering should be done using Python's built-in regular expression engine, rather than the default engine used by Pandas. This can be useful in cases where the default engine doesn't support certain regex features or is too slow.

The .head(10) method is used to return only the first 10 rows of the filtered DataFrame.So, overall, this code is filtering the df_customer DataFrame to only include rows where the status_cd column ends with a digit from 1 to 9, and then returning the first 10 rows of the resulting DataFrame.
 
Commentary : 

This code uses the Pandas library in Python to filter rows from a DataFrame named df_customer. The filter is based on a condition specified in the query() method, which takes a string argument representing a Boolean expression.

status_cd: This is assumed to be a column in the df_customer DataFrame.

.: This is the dot notation used to access a method or attribute of an object in Python.

str: This refers to the string methods of the status_cd column.

contains(): This is a string method that checks whether a given substring or regular expression is present in each element of the column.

r'^[A-F].*[1-9]$': This is a regular expression pattern that matches strings that start with a letter A through F (inclusive) and end with a digit from 1 through 9 (inclusive). The r prefix before the pattern indicates that it is a raw string, which means that backslashes are treated as literal backslashes rather than escape characters.

The query() method is also specifying an engine argument with the value 'python'. This is because the default engine used by Pandas is numexpr, which can be faster for some types of queries but may not support all types of regular expressions. Using the 'python' engine ensures that the regular expression is handled by the Python regex module.

Finally, the .head(10) method call limits the output to the first 10 rows that match the filter condition.
 
Commentary : 

This code uses the Pandas library in Python to filter rows from a DataFrame named df_store. The filter is based on a condition specified in the query() method, which takes a string argument representing a Boolean expression.

tel_no: This is assumed to be a column in the df_store DataFrame.

.: This is the dot notation used to access a method or attribute of an object in Python.

str: This refers to the string methods of the tel_no column.

contains(): This is a string method that checks whether a given substring or regular expression is present in each element of the column.

r'^[0-9]{3}-[0-9]{3}-[0-9]{4}$': This is a regular expression pattern that matches strings that have the format of a North American phone number, which consists of three digits, a hyphen, three more digits, another hyphen, and four more digits. The r prefix before the pattern indicates that it is a raw string, which means that backslashes are treated as literal backslashes rather than escape characters. The curly braces {} indicate that the preceding pattern should be repeated a specified number of times, and the $ at the end signifies the end of the string.

The query() method is also specifying an engine argument with the value 'python'. This is because the default engine used by Pandas is numexpr, which can be faster for some types of queries but may not support all types of regular expressions. Using the 'python' engine ensures that the regular expression is handled by the Python regex module.

The resulting filter will keep only those rows from the df_store DataFrame that have a valid North American phone number in the tel_no column.
 
Commentary : 

This code uses the Pandas library in Python to sort rows of a DataFrame named df_customer based on the values in the birth_day column in ascending order. The resulting sorted DataFrame is then limited to the first 10 rows using the head() method.

df_customer: This is the name of the DataFrame that is being sorted.

sort_values(): This is a Pandas method that is used to sort a DataFrame based on the values in one or more columns. In this case, the birth_day column is specified as the sort key.

'birth_day': This is the name of the column that is used as the sort key.

.head(10): This is a method call that limits the output to the first 10 rows of the sorted DataFrame.

The resulting output will be the first 10 rows of the df_customer DataFrame, sorted in ascending order based on the values in the birth_day column. This will allow you to see the oldest 10 customers in the dataset, assuming that birth_day represents the customers' birth dates.
 
 
Commentary : 

This code uses the Pandas library in Python to sort rows of a DataFrame named df_customer based on the values in the birth_day column in descending order. The resulting sorted DataFrame is then limited to the first 10 rows using the head() method.

df_customer: This is the name of the DataFrame that is being sorted.

sort_values(): This is a Pandas method that is used to sort a DataFrame based on the values in one or more columns. In this case, the birth_day column is specified as the sort key.

'birth_day': This is the name of the column that is used as the sort key.

ascending=False: This argument is used to specify that the sort order should be in descending order rather than the default ascending order.

.head(10): This is a method call that limits the output to the first 10 rows of the sorted DataFrame.

The resulting output will be the first 10 rows of the df_customer DataFrame, sorted in descending order based on the values in the birth_day column. This will allow you to see the youngest 10 customers in the dataset, assuming that birth_day represents the customers' birth dates.
 
 
Commentary : 

The given code is performing the following operations:

Concatenating two dataframes df_receipt[['customer_id', 'amount']] and df_receipt['amount'].rank(method='min', ascending=False) along the columns axis (axis=1) using the pd.concat() function. The resulting dataframe is assigned to the variable df_tmp. The first dataframe df_receipt[['customer_id', 'amount']] contains two columns 'customer_id' and 'amount' from a larger dataframe df_receipt, whereas the second dataframe is the rank of the 'amount' column calculated using the 'min' method in descending order.

Renaming the columns of the concatenated dataframe df_tmp to 'customer_id', 'amount', and 'ranking' using the df_tmp.columns attribute.

Sorting the df_tmp dataframe by the 'ranking' column in ascending order using the sort_values() method and returning the top 10 rows using the head(10) method.

Overall, the code is concatenating the 'customer_id' and 'amount' columns from the df_receipt dataframe with the rank of 'amount' calculated using the 'min' method in descending order. The resulting dataframe is sorted by the 'ranking' column and returns the top 10 rows with the highest rank, which represents the customers who spent the most on their purchases.
 
 

 

Commentary : 

The given code is performing the following operations:

Concatenating two dataframes df_receipt[['customer_id', 'amount']] and df_receipt['amount'].rank(method='first', ascending=False) along the columns axis (axis=1) using the pd.concat() function. The resulting dataframe is assigned to the variable df_tmp. The first dataframe df_receipt[['customer_id', 'amount']] contains two columns 'customer_id' and 'amount' from a larger dataframe df_receipt, whereas the second dataframe is the rank of the 'amount' column calculated using the 'first' method in descending order.

Renaming the columns of the concatenated dataframe df_tmp to 'customer_id', 'amount', and 'ranking' using the df_tmp.columns attribute.

Sorting the df_tmp dataframe by the 'ranking' column in ascending order using the sort_values() method and returning the top 10 rows using the head(10) method.

Overall, the code is concatenating the 'customer_id' and 'amount' columns from the df_receipt dataframe with the rank of 'amount' calculated using the 'first' method in descending order. The resulting dataframe is sorted by the 'ranking' column and returns the top 10 rows with the highest rank, which represents the customers who spent the most on their purchases. The only difference between this code and the previous one is the rank method used. The 'first' method is used here, which assigns the same rank to the equal values of 'amount' column whereas the 'min' method assigns the lowest rank to the equal values of 'amount' column.

 

Data Science 100 Knocks (Structured Data Processing) - Python
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part1 (Q1 to Q20)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part2 (Q21 to Q40)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part3 (Q41 to Q60)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part4 (Q61 to Q80)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing) - Python Part5 (Q81 to Q100)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.
Data Science 100 Knocks (Structured Data Processing)
This is an ipynb file originally created by The Data Scientist Society(データサイエンティスト協会スキル定義委員) and translated from Japanese to English by DeepL. The reason I updated this file is to spread this practice, which is useful for everyone who wants to practice Python, SQL, R, from beginners to advanced engineers. Since this data is created for Japanese, you may face language problems when practicing. But do not worry, it will not affect much.

Comment