๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค 100๋ฒˆ์˜ ๋…ธํฌ(๊ตฌ์กฐํ™” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํŽธ) โ€“ Python Part 4 (Q61 to Q80)

๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python')
df_receipt๋ผ๋Š” DataFrame์—์„œ customer_id ์—ด์ด "Z"๋กœ ์‹œ์ž‘ํ•˜์ง€ ์•Š๋Š” ํ–‰์„ ๋ชจ๋‘ ์„ ํƒํ•œ๋‹ค. ์™„์„ฑ๋œ DataFrame์€ df_sales_amount๋กœ ์ €์žฅ๋œ๋‹ค.

df_sales_amount = df_sales_amount.groupby('customer_id').agg({'amount':'sum'}).reset_index()
df_sales_amount์˜ DataFrame์„ customer_id๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , agg ๋ฉ”์„œ๋“œ๋กœ ๊ฐ ๊ทธ๋ฃน์˜ ๊ธˆ์•ก ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ DataFrame์€ df_sales_amount๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

df_sales_amount['log_amount'] = np.log10(df_sales_amount['amount'] + 0.5)
df_sales_amount DataFrame์˜ amount ์—ด์˜ ๋Œ€์ˆ˜(ํ•˜๋‹จ 10)๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ, 0 ๋˜๋Š” ์Œ์ˆ˜ ๊ฐ’์˜ ๋Œ€์ˆ˜๋ฅผ ์ทจํ•˜์ง€ ์•Š๊ธฐ ์œ„ํ•ด ๋จผ์ € amount ๊ฐ’์— 0.5๋ฅผ ๋”ํ•˜๊ณ  ์žˆ๋‹ค. ๊ฒฐ๊ณผ ๋Œ€์ˆ˜๋Š” log_amount๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

df_sales_amount.head(10)
head ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ df_sales_amount DataFrame์˜ ์ฒ˜์Œ 10์ค„์„ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.

์š”์•ฝํ•˜๋ฉด, ์ด ์ฝ”๋“œ๋Š” ๊ณ ๊ฐ ID๊ฐ€ "Z"๋กœ ์‹œ์ž‘ํ•˜์ง€ ์•Š๋Š” DataFrame์—์„œ ํ–‰์„ ์„ ํƒํ•˜๊ณ , ๊ทธ ํ–‰์„ ๊ณ ๊ฐ ID๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , ๊ฐ ๊ณ ๊ฐ์˜ ํŒ๋งค ๊ธˆ์•ก์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ํŒ๋งค ๊ธˆ์•ก์˜ ๋Œ€์ˆ˜(10์„ ๊ธฐ์ค€์œผ๋กœ)๋ฅผ ์ทจํ•˜์—ฌ ๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_receipt๋ผ๋Š” pandas์˜ DataFrame์— ๋Œ€ํ•ด ๋ช‡ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๋จผ์ € customer_id ์—ด์ด "Z"๋กœ ์‹œ์ž‘ํ•˜๋Š” ํ–‰์„ ํ•„ํ„ฐ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” str.startswith() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์กฐ๊ฑด๋ถ€๋กœ query() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, engine='python' ์ธ์ˆ˜๋Š” ๋ฌธ์ž์—ด ์กฐ์ž‘์— ๊ธฐ๋ณธ pandas ํŒŒ์„œ๊ฐ€ ์•„๋‹Œ Python ํŒŒ์„œ๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ œ๊ณต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์–ป์–ด์ง„ DataFrame์€ groupby() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ customer_id ์ปฌ๋Ÿผ์œผ๋กœ ๊ทธ๋ฃนํ™”๋œ๋‹ค.

agg() ๋ฉ”์„œ๋“œ๋Š” ๊ฐ ๊ทธ๋ฃน์˜ ๊ธˆ์•ก ์ปฌ๋Ÿผ์— ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ์ด ๊ฒฝ์šฐ sum() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ณ ๊ฐ์˜ ๋งค์ถœ ๊ธˆ์•ก์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ reset_index()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ DataFrame์„ ์žฌ์„ค์ •ํ•˜๊ณ  ์ƒˆ๋กœ์šด ์ธ๋ฑ์Šค๋ฅผ ๊ฐ–๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, amount ์—ด์˜ ์ž์—ฐ๋Œ€์ˆ˜์— 0.5๋ฅผ ๋”ํ•œ ๊ฐ’(0 ๋˜๋Š” ์Œ์ˆ˜ ๊ฐ’์˜ ๋Œ€์ˆ˜๋ฅผ ์ทจํ•˜์ง€ ์•Š๊ธฐ ์œ„ํ•ด)์„ ์ทจํ•˜์—ฌ log_amount๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์€ head() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_tmp = df_product.copy()
df_product๋ผ๋Š” DataFrame์˜ ๋ณต์‚ฌ๋ณธ์„ ๋งŒ๋“ค์–ด df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ๋Œ€์ž…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณต์‚ฌ๋ณธ์€ ์›๋ณธ DataFrame์„ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๋„๋ก ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค.

df_tmp['unit_profit'] = df_tmp['unit_price'] - df_tmp['unit_cost'].
์ด๋Š” ๋‹จ๊ฐ€์—์„œ ๋‹จ๊ฐ€๋ฅผ ๋นผ์„œ ๊ฐ ์ƒํ’ˆ์˜ ๋‹จ์œ„๋‹น ์ด์ต์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ unit_profit์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅํ•œ๋‹ค. ์™„์„ฑ๋œ DataFrame์€ df_tmp๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

df_tmp.head(10)
head ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_tmp DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝํ•˜๋ฉด, ์ด ์ฝ”๋“œ๋Š” DataFrame์˜ ๋ณต์‚ฌ๋ณธ์„ ๋งŒ๋“ค๊ณ , ๋‹จ๊ฐ€์—์„œ ๋‹จ๊ฐ€๋ฅผ ๋นผ์„œ ๊ฐ ์ƒํ’ˆ์˜ ๋‹จ์œ„๋‹น ์ด์ต์„ ๊ณ„์‚ฐํ•˜๊ณ , ๋ณต์‚ฌํ•œ DataFrame์— ๋‹จ์œ„๋‹น ์ด์ต์ด ์žˆ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ์ถ”๊ฐ€ํ•˜๊ณ , ๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10์ค„์„ ํ‘œ์‹œํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_tmp = df_product.copy()
df_product๋ผ๋Š” DataFrame์˜ ๋ณต์‚ฌ๋ณธ์„ ๋งŒ๋“ค์–ด df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ๋Œ€์ž…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณต์‚ฌ๋ณธ์€ ์›๋ณธ DataFrame์„ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๋„๋ก ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค.

df_tmp['unit_profit_rate'] = (df_tmp['unit_price'] - df_tmp['unit_cost']) / df_tmp['unit_price'].
์ด์ต(๋‹จ๊ฐ€์—์„œ ๋‹จ๊ฐ€๋ฅผ ๋บ€ ๊ฒƒ)์„ ๋‹จ๊ฐ€๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ์ƒํ’ˆ์˜ ๋‹จ์œ„๋‹น ์ด์ต๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ unit_profit_rate๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์™„์„ฑ๋œ DataFrame์€ df_tmp๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

df_tmp['unit_profit_rate'].mean(skipna=True)
df_tmp DataFrame์˜ unit_profit_rate ์—ด์˜ ํ‰๊ท ๊ฐ’์„ mean ๋ฐฉ์‹์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ณ , ๊ฒฐ์†๊ฐ’(NaN)์ด ์žˆ์œผ๋ฉด ๊ฑด๋„ˆ๋›ฐ๊ณ  ๊ฒฐ๊ณผ๋ฅผ ํ‘œ์‹œํ•œ๋‹ค.

์š”์•ฝํ•˜๋ฉด, DataFrame์˜ ๋ณต์‚ฌ๋ณธ์„ ์ƒ์„ฑํ•˜๊ณ , ์ด์ต(๋‹จ๊ฐ€์—์„œ ๋‹จ๊ฐ€๋ฅผ ๋บ€ ๊ธˆ์•ก)์„ ๋‹จ๊ฐ€๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ์ƒํ’ˆ์˜ ๋‹จ์œ„๋‹น ์ด์ต๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๋ณต์‚ฌํ•œ DataFrame์— ๋‹จ์œ„๋‹น ์ด์ต๋ฅ  ์—ด์„ ์ƒˆ๋กœ ์ถ”๊ฐ€ํ•˜๊ณ , ์ด์ต๋ฅ  ์—ด์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ฒฐ๊ณผ์˜ ํ‰๊ท ๊ฐ’์„ ํ‘œ์‹œํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜์—ฌ product_cd, unit_price, unit_cost๋ผ๋Š” ์„ธ ๊ฐœ์˜ ์—ด๋งŒ ํฌํ•จํ•˜๋„๋ก ํ•˜๊ณ , ์›๋ž˜์˜ df_product DataFrame์—์„œ ์ด ์—ด๋“ค์„ ๋ณต์‚ฌํ•œ๋‹ค.

df_tmp['new_price'] = np.floor(df_tmp['unit_cost'] / 0.7)
์ด๋Š” ๋‹จ๊ฐ€๋ฅผ 0.7(๋งˆํฌ์—… ๊ณ„์ˆ˜ 1.43)๋กœ ๋‚˜๋ˆ„๊ณ  floor ๋ฉ”์„œ๋“œ๋กœ ์†Œ์ˆ˜์  ์ดํ•˜๋ฅผ ๋ฐ˜์˜ฌ๋ฆผํ•˜์—ฌ ๊ฐ ์ƒํ’ˆ์˜ ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ new_price๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ DataFrame์€ df_tmp๋กœ ์ €์žฅ๋œ๋‹ค.

df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price'].
์ด์ „ ๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐํ•œ ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ƒํ’ˆ์˜ ๋‹จ์œ„๋‹น ์ˆ˜์ต๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ new_profit_rate๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ DataFrame์€ df_tmp๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

df_tmp.head(10)
head ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_tmp DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝํ•˜๋ฉด, ์ด ์ฝ”๋“œ๋Š” ์„ธ ๊ฐœ์˜ ์—ด์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด DataFrame์„ ๋งŒ๋“ค๊ณ , ๋งˆํฌ์—… ๊ณ„์ˆ˜ 1.43์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ ์ƒํ’ˆ์˜ ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์„ ๊ณ„์‚ฐํ•˜๊ณ , ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ƒํ’ˆ์˜ ๋‹จ์œ„๋‹น ์ˆ˜์ต๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๋ณต์‚ฌํ•œ DataFrame์— ์ƒˆ๋กœ์šด ๋‘ ๊ฐœ์˜ ์—ด์„ ์ถ”๊ฐ€ํ•˜๊ณ , ๊ฒฐ๊ณผ์˜ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy()
df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜์—ฌ product_cd, unit_price, unit_cost ์„ธ ๊ฐœ์˜ ์—ด๋งŒ ํฌํ•จํ•˜๋„๋ก ํ•˜๊ณ , ์›๋ž˜์˜ df_product DataFrame์—์„œ ์ด ์—ด๋“ค์„ ๋ณต์‚ฌํ•œ๋‹ค.

df_tmp['new_price'] = np.round(df_tmp['unit_cost'] / 0.7)
์ด๊ฒƒ์€ ๋‹จ๊ฐ€๋ฅผ 0.7(๋งˆํฌ์—… ๊ณ„์ˆ˜ 1.43)๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ์ƒํ’ˆ์˜ ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์„ ๊ณ„์‚ฐํ•˜๊ณ , round ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ •์ˆ˜๋กœ ๋ฐ˜์˜ฌ๋ฆผํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ new_price๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ DataFrame์€ df_tmp๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

df_tmp['new_profit_rate'] = (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price'].
์ด์ „ ๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐํ•œ ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ƒํ’ˆ์˜ ๋‹จ์œ„๋‹น ์ˆ˜์ต๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ new_profit_rate๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ DataFrame์€ df_tmp๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

df_tmp.head(10)
head ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_tmp DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝํ•˜๋ฉด, ์ด ์ฝ”๋“œ๋Š” ์„ธ ๊ฐœ์˜ ์—ด์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜๊ณ , ๋งˆํฌ์—… ๊ณ„์ˆ˜ 1.43๊ณผ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ •์ˆ˜๋กœ ๋ฐ˜์˜ฌ๋ฆผํ•˜์—ฌ ๊ฐ ์ œํ’ˆ์˜ ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์„ ๊ณ„์‚ฐํ•˜๊ณ , ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ œํ’ˆ์˜ ๋‹จ์œ„๋‹น ์ˆ˜์ต๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๋ณต์‚ฌํ•œ DataFrame์— ์ƒˆ๋กœ์šด ๋‘ ๊ฐœ์˜ ์—ด์„ ์ถ”๊ฐ€ํ•˜๊ณ , ๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_product๋ผ๋Š” pandas DataFrame ๊ฐ์ฒด๋ฅผ ์กฐ์ž‘ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ํ•œ ์ค„์”ฉ ๋ถ„ํ•ดํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

df_tmp = df_product[['product_cd', 'unit_price', 'unit_cost']].copy(): ์ด ๋ผ์ธ์€ df_product์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์œผ๋กœ "product_cd", "unit_price", "unit_cost"์˜ ์—ด๋งŒ ๋ฅผ ํฌํ•จํ•˜๋Š” df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. .copy() ๋ฉ”์„œ๋“œ๋Š” ์›๋ณธ df_product๋ฅผ ์ฐธ์กฐํ•˜๋Š” ๋Œ€์‹  DataFrame์˜ ์ƒˆ๋กœ์šด ๋ณต์‚ฌ๋ณธ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.

df_tmp['new_price'] = np.ceiling(df_tmp['unit_cost'] / 0.7): ์ด ํ–‰์€ df_tmp์— 'new_price'๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๋Š” 'unit_cost' ์—ด์„ 0.7๋กœ ๋‚˜๋ˆˆ ํ›„ numpy์˜ ceiling() ํ•จ์ˆ˜๋กœ ์ƒํ•œ์„ ๊ตฌํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ด์œค์œจ 30%๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

df_tmp['new_profit_rate'] = \ (df_tmp['new_price'] - df_tmp['unit_cost']) / df_tmp['new_price']: ์ด ํ–‰์€ df_tmp์— 'new_profit_rate'๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ๋งŒ๋“ค๊ณ  'new_price' ์—ด์— 'unit_cost'๋ฅผ ์ž…๋ ฅํ•œ๋‹ค. new_price' ์—ด์—์„œ 'unit_cost' ์—ด์„ ๋นผ๊ณ  'new_price' ์—ด๋กœ ๋‚˜๋ˆˆ ๊ฒฐ๊ณผ์ด๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์— ๋Œ€ํ•œ ์ˆ˜์ต๋ฅ ์ด ๊ณ„์‚ฐ๋œ๋‹ค.

df_tmp.head(10): df_tmp DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ์ถœ๋ ฅํ•œ๋‹ค. ์ด ํ–‰์€ ์ด์ „ ํ–‰์˜ ๊ณ„์‚ฐ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ์—์„œ๋Š” df_product๋ผ๋Š” ์ด๋ฆ„์˜ pandas DataFrame ๊ฐ์ฒด๋„ ์กฐ์ž‘ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ํ•œ ์ค„์”ฉ ๋ถ„ํ•ดํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

df_tmp = df_product[['product_cd', 'unit_price']].copy(): ์ด ์ค„์€ df_product์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์œผ๋กœ "product_cd"์™€ "unit_price" ์ปฌ๋Ÿผ๋งŒ ํฌํ•จํ•˜๋Š” df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. .copy() ๋ฉ”์„œ๋“œ๋Š” ์›๋ž˜์˜ df_product๋ฅผ ์ฐธ์กฐํ•˜๋Š” ๋Œ€์‹  DataFrame์˜ ์ƒˆ๋กœ์šด ๋ณต์‚ฌ๋ณธ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

df_tmp['tax_price'] = np.floor(df_tmp['unit_price'] * 1.1): ์ด ํ–‰์€ df_tmp์— 'tax_price'๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ์—ด์€ 'unit_price' ์—ด์— 1.1์„ ๊ณฑํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ์˜ ๋ฐ”๋‹ฅ์„ numpy์˜ floor() ํ•จ์ˆ˜๋กœ ๊ตฌํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด 10%์˜ ์„ธ๊ธˆ์ด ์ถ”๊ฐ€๋œ ์ƒˆ๋กœ์šด ๊ฐ€๊ฒฉ์ด ๊ณ„์‚ฐ๋œ๋‹ค.

df_tmp.head(10): ์ด ํ–‰์€ df_tmp์˜ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ์ถœ๋ ฅํ•œ๋‹ค. ์ด์ „ ํ–‰์˜ ๊ณ„์‚ฐ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ์—์„œ๋Š” pandas์˜ DataFrame ๊ฐ์ฒด์ธ df_receipt์™€ df_product๋„ ์กฐ์ž‘ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ํ•œ ์ค„์”ฉ ๋ถ„ํ•ดํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

df_tmp_1 = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index().rename(columns={'amount':'sum_all'}): ์ด ์ค„์€ df_receipt DataFrame์„ "customer_id"๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค. receipt DataFrame์„ "customer_id"๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ทธ๋ฃน๋ณ„๋กœ "amount" ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ DataFrame์˜ ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ดํ›„ .rename() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ DataFrame์˜ ์ด๋ฆ„์„ "sum_all"๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

df_tmp_2 = pd.merge(df_receipt, df_product.query('category_major_cd == "07"'), how='inner', on='product_cd').groupby('customer_id').agg ({'amount': sum'}).reset_index().rename(columns={'amount': 'sum_07' }): ์ด ๋ผ์ธ์€ ๋จผ์ € df_product๋ฅผ ํ•„ํ„ฐ๋งํ•˜์—ฌ category_major_cd ์—ด์ด "07"๊ณผ ๊ฐ™์€ ํ–‰๋งŒ ํฌํ•จํ•˜๋„๋ก ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ด ํ•„ํ„ฐ๋ง๋œ DataFrame์„ ๋‚ด๋ถ€ ๋ฐ”์ธ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ "product_cd" ์—ด์˜ df_receipt์™€ ๊ฒฐํ•ฉํ•œ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๊ฒฐ๊ณผ DataFrame์„ 'customer_id'๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ทธ๋ฃน๋ณ„๋กœ 'amount' ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•œ๋‹ค. ์™„์„ฑ๋œ DataFrame์€ .rename() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ "sum_07"๋กœ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ๋œ๋‹ค.

df_tmp_3 = pd.merge(df_tmp_1, df_tmp_2, how='inner', on='customer_id'): ์ด ํ–‰์€ ๋‚ด๋ถ€ ๊ฒฐํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ df_tmp_1๊ณผ df_tmp_2๋ฅผ 'customer_id' ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ‘ํ•ฉํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ ๊ณ ๊ฐ์˜ 'sum_all' ์ปฌ๋Ÿผ๊ณผ 'sum_07' ์ปฌ๋Ÿผ์„ ํฌํ•จํ•˜๋Š” ์ƒˆ๋กœ์šด DataFrame์ด ์ƒ์„ฑ๋œ๋‹ค.

df_tmp_3['sales_rate'] = df_tmp_3['sum_07'] / df_tmp_3['sum_all']: ์ด ํ–‰์€ df_tmp_3์˜ ์ƒˆ๋กœ์šด ์—ด 'sales_rate'๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์œผ๋กœ, 'sum_07' ์—ด์„ 'sum_all' ์—ด๋กœ ๋‚˜๋ˆˆ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ณ ๊ฐ์˜ ์ด ๊ตฌ๋งค์•ก ์ค‘ '07' ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ๊ตฌ๋งคํ•œ ๊ธˆ์•ก์˜ ๋น„์œจ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

df_tmp_3.head(10): df_tmp_3 DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ์ถœ๋ ฅํ•œ๋‹ค. ์ด ํ–‰์€ ์ด์ „ ํ–‰์˜ ๊ณ„์‚ฐ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.

ย 

ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ์—์„œ๋Š” pandas์˜ DataFrame ๊ฐ์ฒด์ธ df_receipt์™€ df_product๋„ ์กฐ์ž‘ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ํ•œ ์ค„์”ฉ ๋ถ„ํ•ดํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

df_temp = df_receipt.merge(df_product, how='left', on='product_cd').groupby(['customer_id', 'category_major_cd']) ['amount'].sum(). unstack(): ์ด ํ–‰์€ ๋จผ์ € ์™ผ์ชฝ ๊ฒฐํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ df_receipt์™€ df_product๋ฅผ "product_cd" ์ปฌ๋Ÿผ์œผ๋กœ ๊ฒฐํ•ฉํ•œ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๊ฒฐ๊ณผ DataFrame์„ "customer_id"์™€ "category_major_cd"๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ฐ ๊ทธ๋ฃน์˜ "amount" ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๊ฒฐ๊ณผ ์‹œ๋ฆฌ์ฆˆ๋ฅผ ์–ธ์Šคํƒํ•˜์—ฌ "customer_id"๋ฅผ ์ธ๋ฑ์Šค๋กœ, "category_major_cd"๋ฅผ ์—ด ์ด๋ฆ„์œผ๋กœ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. category_major_cd", "amount" ์—ด์„ ๊ฐ’์œผ๋กœ ํ•˜๋Š” DataFrame์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

df_temp = df_temp[df_temp['07'] > 0]: ์ด ํ–‰์€ df_temp DataFrame์„ ํ•„ํ„ฐ๋งํ•˜์—ฌ '07' ์—ด์˜ ๊ฐ’์ด 0๋ณด๋‹ค ํฐ ํ–‰๋งŒ ํฌํ•จํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด '07' ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ์•„๋ฌด๊ฒƒ๋„ ๊ตฌ๋งคํ•˜์ง€ ์•Š์€ ๊ณ ๊ฐ์ด ์ œ๊ฑฐ๋ฉ๋‹ˆ๋‹ค.

df_temp['sum_all'] = df_temp.sum(axis=1): ์ด ํ–‰์€ df_temp์— 'sum_all'์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ ํ–‰์˜ ๋ชจ๋“  ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊ฐ ๊ณ ๊ฐ์ด ๋ชจ๋“  ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ์ง€์ถœํ•œ ๊ธˆ์•ก์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

df_temp['sales_rate'] = df_temp['07'] / df_temp['sum_all']: ์ด ํ–‰์€ df_temp์— 'sales_rate'๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๋Š” ๊ณ ๊ฐ์˜ ์ด ๊ตฌ๋งค ๊ธˆ์•ก ์ค‘ '07' ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ๊ตฌ๋งคํ•œ ๊ธˆ์•ก์˜ ๋น„์œจ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

df_temp.columns.name = '': ์ด ํ–‰์€ df_temp DataFrame์—์„œ ์—ด ์ด๋ฆ„์„ ์‚ญ์ œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

df_temp = df_temp.reset_index(): ์ด ํ–‰์€ df_temp์˜ ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜๊ณ  0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ์ •์ˆ˜ ๋ฒ”์œ„๋ฅผ ๊ฐ€์ง„ 'index'๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

df_temp.head(10): df_temp ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ์ถœ๋ ฅํ•œ๋‹ค. ์ด ํ–‰์€ ์ด์ „ ํ–‰์˜ ๊ณ„์‚ฐ์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ˆ˜ํ–‰๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_receipt๋ผ๋Š” DataFrame๊ณผ df_customer๋ผ๋Š” ๋˜ ๋‹ค๋ฅธ DataFrame์— ๋Œ€ํ•ด ๋ฐ์ดํ„ฐ ์กฐ์ž‘์„ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ณ ๊ฐ์˜ ์‹ ์ฒญ์ผ๊ณผ ๊ตฌ๋งค์ผ ์‚ฌ์ด์˜ ์ผ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์ฝ”๋“œ์˜ ๋‹จ๊ณ„๋ณ„ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): df_receipt DataFrame์—์„œ "customer_id"์™€ "sales_ymd" ์ปฌ๋Ÿผ๋งŒ ์„ ํƒํ•ด์„œ df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜๊ณ  ์ค‘๋ณต๋œ ํ–‰์„ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ๊ณ ๊ฐ ID์™€ ๋งค์ถœ ๋‚ ์งœ์˜ ์กฐํ•ฉ์ด ๊ณ ์œ ํ•œ DataFrame์ด ๋งŒ๋“ค์–ด์กŒ๋‹ค.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): df_tmp์™€ df_customer DataFrame์˜ ๋‚ด๋ถ€ ๊ฒฐํ•ฉ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , df_customer DataFrame์—์„œ 'customer_id' ์—ด๊ณผ 'application_date' ์—ด๋งŒ ์„ ํƒํ•œ๋‹ค. ๊ฒฐ๊ณผ DataFrame์—๋Š” df_tmp์™€ df_customer์— ๋ชจ๋‘ ์กด์žฌํ•˜๋Š” ๊ณ ๊ฐ ID๋ฅผ ๊ฐ€์ง„ ํ–‰๋งŒ ํฌํ•จํ•˜๊ฒŒ ๋œ๋‹ค.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): df_tmp์˜ "sales_ymd" ์—ด์„ pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์ปฌ๋Ÿผ์„ ๋ฌธ์ž์—ด ํ˜•์‹์—์„œ ๊ณ„์‚ฐ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): df_tmp์˜ "application_date" ์นผ๋Ÿผ์„ pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์ปฌ๋Ÿผ์„ ๋ฌธ์ž์—ด ํ˜•์‹์—์„œ ๊ณ„์‚ฐ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp['elapsed_days'] = df_tmp['sales_ymd'] - df_tmp['application_date']: "sales_ymd"์™€ "application_date" ์—ด์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ df_tmp์˜ ์ƒˆ๋กœ์šด ์—ด "elapsed _days"์— ๋Œ€์ž…ํ•œ๋‹ค. ์ด ๊ณ„์‚ฐ์˜ ๊ฒฐ๊ณผ๋กœ ๋‘ ๋‚ ์งœ์˜ ์ฐจ์ด๋ฅผ ์ผ, ์‹œ๊ฐ„, ๋ถ„, ์ดˆ ๋‹จ์œ„๋กœ ํ‘œํ˜„ํ•œ timedelta ๊ฐ์ฒด๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค.

df_tmp['elapsed_days'] = df_tmp['elapsed_days'].dt.days: "elapsed_days" ์—ด์—์„œ ์ผ์ˆ˜๋งŒ ๊ฐ€์ ธ์™€์„œ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ™์€ ์—ด์— ๋Œ€์ž…ํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•œ๋‹ค. ๊ฐ์ฒด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์ง€๋งŒ, ์ผ์ˆ˜๋งŒ ๋ณด๊ด€ํ•˜๊ณ  ์‹ถ๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

df_tmp.head(10): df_tmp์˜ ์ฒ˜์Œ 10์ค„์„ ํ‘œ์‹œํ•˜์—ฌ ๊ณ„์‚ฐ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค.

์ „๋ฐ˜์ ์œผ๋กœ ์ด ์ฝ”๋“œ๋Š” ๊ณ ๊ฐ์˜ ํ–‰๋™์„ ๋ถ„์„ํ•˜๊ณ  ๊ณ ๊ฐ์ด ๋ฉค๋ฒ„์‹ญ์ด๋‚˜ ๊ณ„์ •์„ ์‹ ์ฒญํ•œ ํ›„ ๊ตฌ๋งคํ•˜๊ธฐ๊นŒ์ง€ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์„ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•˜๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ์ง€๋‚œ๋ฒˆ ์„ค๋ช…ํ•œ ์ฝ”๋“œ์™€ ๋น„์Šทํ•˜์ง€๋งŒ, ๊ณ ๊ฐ์˜ ์‹ ์ฒญ์ผ๋กœ๋ถ€ํ„ฐ ๊ตฌ๋งค์ผ๊นŒ์ง€์˜ ์ผ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ฒฝ๊ณผํ•œ ๊ฐœ์›” ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์ฝ”๋“œ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): df_receipt DataFrame์—์„œ customer_id์™€ sales_ymd ์—ด๋งŒ ์„ ํƒํ•˜๊ณ  ์ค‘๋ณต๋œ ํ–‰์„ ์‚ญ์ œํ•˜์—ฌ ์ƒˆ๋กœ์šด df_tmp๋ผ๋Š” DataFrame์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ณ ๊ฐ ID์™€ ํŒ๋งค ๋‚ ์งœ์˜ ์กฐํ•ฉ์ด ๊ณ ์œ ํ•œ DataFrame์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): df_tmp์™€ df_customer DataFrame์˜ ๋‚ด๋ถ€ ๊ฒฐํ•ฉ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , df_customer DataFrame์—์„œ 'customer_id' ์—ด๊ณผ 'application_date' ์—ด๋งŒ ์„ ํƒํ•œ๋‹ค. ๊ฒฐ๊ณผ DataFrame์—๋Š” df_tmp์™€ df_customer์— ๋ชจ๋‘ ์กด์žฌํ•˜๋Š” ๊ณ ๊ฐ ID๋ฅผ ๊ฐ€์ง„ ํ–‰๋งŒ ํฌํ•จํ•˜๊ฒŒ ๋œ๋‹ค.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): df_tmp์˜ "sales_ymd" ์—ด์„ pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): df_tmp์˜ "application_date" ์ปฌ๋Ÿผ์„ pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp['elapsed_months'] = df_tmp[['sales_ymd', 'application_date']]. \ ฮ“ apply(lambda x: relativedelta(x[0], x[1]).years * 12 + ฮ“ relativedelta(x[0], x[1]).months, axis=1): "sales_ymd"์™€ "application_date" ์—ด์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค, ๊ฒฐ๊ณผ๋ฅผ df_tmp์˜ "elapsed_months"๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์— ๋Œ€์ž…ํ•œ๋‹ค. ์ด ๊ณ„์‚ฐ์€ dateutil ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ relativedelta ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ datetime ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ๋…„ ๋˜๋Š” ์›” ๋‹จ์œ„๋กœ ๊ณ„์‚ฐํ•˜๊ณ , apply() ๋ฉ”์„œ๋“œ์—์„œ "sales_ymd"์™€ "application_date" ์—ด์˜ ๊ฐ ํ–‰์— ์ด ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๊ณ  ๊ฒฝ๊ณผํ•œ ๊ฐœ์›”์˜ ํ•ฉ๊ณ„๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

df_tmp.head(10): df_tmp์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜์—ฌ ๊ณ„์‚ฐ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค.

์ „์ฒด์ ์œผ๋กœ ์ด ์ฝ”๋“œ๋Š” ๊ณ ๊ฐ์˜ ํ–‰๋™์„ ๋ถ„์„ํ•˜๊ณ  ๊ณ ๊ฐ์ด ํšŒ์› ๊ฐ€์ž… ๋˜๋Š” ๊ณ„์ •์„ ์‹ ์ฒญํ•œ ํ›„ ๊ตฌ๋งคํ•˜๊ธฐ๊นŒ์ง€์˜ ๊ธฐ๊ฐ„์„ ๊ฒฝ๊ณผ ๊ฐœ์›” ์ˆ˜๋กœ ํŒ๋‹จํ•˜๋Š” ๋ฐ ํŽธ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ์•ž์„œ ์„ค๋ช…ํ•œ ์˜ˆ์ œ์™€ ๋น„์Šทํ•˜์ง€๋งŒ, ๊ณ ๊ฐ์˜ ์‹ ์ฒญ์ผ๋กœ๋ถ€ํ„ฐ ๊ตฌ๋งค์ผ๊นŒ์ง€์˜ ๊ฒฝ๊ณผ ์ผ์ˆ˜๋‚˜ ๊ฒฝ๊ณผ ์›”์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ฒฝ๊ณผ ๋…„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): df_receipt DataFrame์—์„œ customer_id์™€ sales_ymd๋ผ๋Š” ์ปฌ๋Ÿผ๋งŒ ์„ ํƒํ•˜์—ฌ ์ค‘๋ณต๋˜๋Š” ํ–‰์„ ๋ชจ๋‘ ์‚ญ์ œํ•˜๊ณ  ํ•˜์—ฌ df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ณ ๊ฐ ID์™€ ํŒ๋งค์ผ ์กฐํ•ฉ์ด ๊ณ ์œ ํ•œ DataFrame์ด ๋งŒ๋“ค์–ด์ง„๋‹ค.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): df_tmp์™€ df_customer DataFrame์˜ ๋‚ด๋ถ€ ๊ฒฐํ•ฉ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , df_customer DataFrame์—์„œ 'customer_id' ์—ด๊ณผ 'application_date' ์—ด๋งŒ ์„ ํƒํ•œ๋‹ค. ๊ฒฐ๊ณผ DataFrame์—๋Š” df_tmp์™€ df_customer์— ๋ชจ๋‘ ์กด์žฌํ•˜๋Š” ๊ณ ๊ฐ ID๋ฅผ ๊ฐ€์ง„ ํ–‰๋งŒ ํฌํ•จํ•˜๊ฒŒ ๋œ๋‹ค.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): df_tmp์˜ "sales_ymd" ์—ด์„ pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): df_tmp์˜ "application_date" ์ปฌ๋Ÿผ์„ pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp['elapsed_years'] = df_tmp[['sales_ymd', 'application_date']]. \ ฮ“ apply(lambda x: relativedelta(x[0], x[1]).years, axis=1): "sales_ymd"์™€ "application_date" ์—ด์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ df_tmp์˜ ์ƒˆ๋กœ์šด ์—ด "elapsed_years"์— ๋Œ€์ž…ํ•œ๋‹ค. ์— ๋Œ€์ž…ํ•œ๋‹ค. ์ด ๊ณ„์‚ฐ์€ dateutil ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ relativedelta ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ datetime ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ์—ฐ ๋‹จ์œ„๋กœ ๊ณ„์‚ฐํ•˜๊ณ , apply() ๋ฉ”์„œ๋“œ๋Š” "sales_ymd"์™€ "application_date" ์—ด์˜ ๊ฐ ํ–‰์— ์ด ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ์ด ๊ฒฝ๊ณผ ์—ฐ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

df_tmp.head(10): df_tmp์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜์—ฌ ๊ณ„์‚ฐ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค.

์ „์ฒด์ ์œผ๋กœ ์ด ์ฝ”๋“œ๋Š” ๊ณ ๊ฐ์˜ ํ–‰๋™์„ ๋ถ„์„ํ•˜๊ณ  ๊ณ ๊ฐ์ด ํšŒ์› ๊ฐ€์ž…์ด๋‚˜ ๊ณ„์ •์„ ์‹ ์ฒญํ•œ ํ›„ ๊ตฌ๋งคํ•˜๊ธฐ๊นŒ์ง€์˜ ๊ธฐ๊ฐ„์„ ๊ฒฝ๊ณผ ๋…„์ˆ˜๋กœ ํŒ๋‹จํ•˜๋Š” ๋ฐ ์œ ์šฉํ•˜๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” datetime์˜ ๊ฐ’์„ ์œ ๋‹‰์Šค ํƒ€์ž„์Šคํƒฌํ”„๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ณ ๊ฐ์˜ ์‹ ์ฒญ์ผ๊ณผ ๊ตฌ๋งค์ผ ์‚ฌ์ด์˜ ๊ฒฝ๊ณผ ์‹œ๊ฐ„(์ดˆ)์„ ๊ณ„์‚ฐํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ์ฝ”๋“œ์˜ ๋‹จ๊ณ„๋ณ„ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

df_tmp = df_receipt[['customer_id', 'sales_ymd']].drop_duplicates(): df_receipt DataFrame์—์„œ "customer_id"์™€ "sales_ymd" ์ปฌ๋Ÿผ๋งŒ ์„ ํƒํ•ด์„œ df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜๊ณ , ์ค‘๋ณต๋œ ํ–‰์€ ์‚ญ์ œํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๊ณ ๊ฐ ID์™€ ํŒ๋งค์ผ ์กฐํ•ฉ์ด ๊ณ ์œ ํ•œ DataFrame์ด ๋งŒ๋“ค์–ด์กŒ๋‹ค.

df_tmp = pd.merge(df_tmp, df_customer[['customer_id', 'application_date']], how='inner', on='customer_id'): df_tmp์™€ df_customer DataFrame์˜ ๋‚ด๋ถ€ ๊ฒฐํ•ฉ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , df_customer DataFrame์—์„œ 'customer_id' ์—ด๊ณผ 'application_date' ์—ด๋งŒ ์„ ํƒํ•œ๋‹ค. ๊ฒฐ๊ณผ DataFrame์—๋Š” df_tmp์™€ df_customer์— ๋ชจ๋‘ ์กด์žฌํ•˜๋Š” ๊ณ ๊ฐ ID๋ฅผ ๊ฐ€์ง„ ํ–‰๋งŒ ํฌํ•จํ•˜๊ฒŒ ๋œ๋‹ค.

df_tmp['sales_ymd'] = pd.to_datetime(df_tmp['sales_ymd'].astype('str')): df_tmp์˜ "sales_ymd" ์—ด์„ pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp['application_date'] = pd.to_datetime(df_tmp['application_date']): df_tmp์˜ "application_date" ์ปฌ๋Ÿผ์„ pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp['elapsed_epoch'] = df_tmp['sales_ymd'].view(np.int64) - \ df_tmp['application_date'].view(np.int64): "sales_ymd" ์™€ "application_date" ์—ด์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. date" ์—ด์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ df_tmp์˜ "elapsed_epoch" ์—ด์— ํ• ๋‹นํ•œ๋‹ค. ์ด ๊ณ„์‚ฐ์€ view() ๋ฉ”์„œ๋“œ์—์„œ datetime ๊ฐ’์„ int64 ํ˜•์‹(Unix ํƒ€์ž„์Šคํƒฌํ”„)์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ๋‘ ๊ฐ’์„ ์š”์†Œ๋ณ„๋กœ ๊ฐ์‚ฐํ•˜์—ฌ ๋‚˜๋…ธ์ดˆ ๋‹จ์œ„์˜ ๊ฒฝ๊ณผ ์‹œ๊ฐ„์„ ๊ตฌํ•˜๊ณ  ์žˆ๋‹ค.

df_tmp['elapsed_epoch'] = df_tmp['elapsed_epoch'] / 10**9: "elapsed_epoch" ์—ด์„ 10^9๋กœ ๋‚˜๋ˆ„์–ด ๊ฒฝ๊ณผ ์‹œ๊ฐ„์„ ๋‚˜๋…ธ์ดˆ์—์„œ ์ดˆ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

df_tmp.head(10): df_tmp์˜ ์ฒ˜์Œ 10์ค„์„ ํ‘œ์‹œํ•˜์—ฌ ๊ณ„์‚ฐ์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค.

์ „์ฒด์ ์œผ๋กœ ์ด ์ฝ”๋“œ๋Š” ๊ณ ๊ฐ์˜ ํ–‰๋™์„ ๋ถ„์„ํ•˜๊ณ  ๊ณ ๊ฐ์ด ํšŒ์› ๊ฐ€์ž…์ด๋‚˜ ๊ณ„์ •์„ ์‹ ์ฒญํ•œ ํ›„ ๊ตฌ๋งคํ•˜๊ธฐ๊นŒ์ง€์˜ ์‹œ๊ฐ„์„ ์ดˆ ๋‹จ์œ„๋กœ ํŒ๋‹จํ•˜๋Š” ๋ฐ ์œ ์šฉํ•˜๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด DataFrame df_receipt์—์„œ sales_ymd ์ปฌ๋Ÿผ์„ ์„ ํƒํ•˜๊ณ  ํ•ด๋‹น ์ปฌ๋Ÿผ์˜ ๋ณต์‚ฌ๋ณธ์„ ์ƒ์„ฑํ•˜์—ฌ df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ df_tmp์˜ sales_ymd ์ปฌ๋Ÿผ์€ pd.to_datetime ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ pandas์˜ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์š”์ผ ๋“ฑ์˜ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋Š” ๋“ฑ ๋ณด๋‹ค ์œ ์—ฐํ•œ ๋ฐฉ์‹์œผ๋กœ ๋‚ ์งœ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

df_tmp์—๋Š” apply ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ elapsed_days๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” sales_ymd ์—ด์˜ ๊ฐ ํ–‰์— ๋žŒ๋‹ค ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๊ณ , ์ด ๊ฒฝ์šฐ ์š”์ผ์„ ์ •์ˆ˜ ๊ฐ’์œผ๋กœ ๋ฐ˜ํ™˜ํ•œ๋‹ค(์›”์š”์ผ์€ 0, ์ผ์š”์ผ์€ 6์ด๋‹ค).

apply ๋ฉ”์„œ๋“œ์™€ ๋˜ ๋‹ค๋ฅธ ๋žŒ๋‹ค ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_tmp์— new column named monday๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” sales_ymd ์—ด์˜ ๊ฐ ํ–‰์„ ๋ฐ›์•„ ์ฃผ์ดˆ๋ถ€ํ„ฐ ๊ฒฝ๊ณผํ•œ ์ผ์ˆ˜(x.weekday()์—์„œ ์–ป์€)๋ฅผ ๋นผ๊ณ , ๊ทธ ๊ฒฐ๊ณผ ๋‚ ์งœ๋ฅผ ๊ฐ€์žฅ ์ตœ๊ทผ ์›”์š”์ผ๋กœ ๋ฐ˜์˜ฌ๋ฆผํ•˜์—ฌ ๋ฐ˜ํ™˜ํ•œ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด df_tmp์˜ ๋ชจ๋“  ๋‚ ์งœ๊ฐ€ ์›”์š”์ผ์„ ๊ธฐ์ ์œผ๋กœ ํ•œ ์ฃผ ๋‹จ์œ„๋กœ ํšจ๊ณผ์ ์œผ๋กœ ๊ทธ๋ฃนํ™”๋ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ head ๋ฉ”์„œ๋“œ๊ฐ€ df_tmp์—์„œ ํ˜ธ์ถœ๋˜์–ด ๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10ํ–‰์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

์ „์ฒด์ ์œผ๋กœ ์ด ์ฝ”๋“œ๋Š” ํŒ๋งค์ผ ์—ด์„ ์ฒ˜๋ฆฌํ•˜์—ฌ ๋ถ„์„ ๋˜๋Š” ์ง‘๊ณ„ ๋ชฉ์ ์œผ๋กœ ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ DataFrame์— ์š”์ผ๊ณผ ํ•ด๋‹น ์ฃผ์˜ ์‹œ์ž‘์ผ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ์ƒ์„ฑํ•˜์—ฌ ์ฃผ ๋˜๋Š” ์š”์ผ๋ณ„๋กœ ํŒ๋งค ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜๊ฑฐ๋‚˜ ์ง‘๊ณ„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_customer๋ผ๋Š” pandas์˜ DataFrame์—์„œ ํ–‰์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๊ทธ ํ•˜์œ„ ์ง‘ํ•ฉ์˜ ์ฒซ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” ์ฝ”๋“œ์˜ ๊ฐ ๋ถ€๋ถ„์ด ํ•˜๋Š” ์ผ์ด๋‹ค.

df_customer๋Š” ๊ณ ๊ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋œ pandas์˜ DataFrame์œผ๋กœ, ์•„๋งˆ๋„ ๋งŽ์€ ํ–‰๊ณผ ์—ด์ด ์žˆ์„ ๊ฒƒ์ด๋‹ค.

sample ๋ฉ”์„œ๋“œ๋Š” df_customer์— ๋Œ€ํ•ด frac=0.01์ด๋ผ๋Š” ์ธ์ˆ˜๋กœ ํ˜ธ์ถœ๋œ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” DataFrame์˜ ์ผ๋ถ€ ํ–‰์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ frac์€ ์ƒ˜ํ”Œ๋งํ•  ํ–‰์˜ ๋น„์œจ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ด ๊ฒฝ์šฐ frac=0.01์€ df_customer์˜ ํ–‰ ์ค‘ 1%๊ฐ€ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋จ์„ ์˜๋ฏธํ•œ๋‹ค.

head ๋ฉ”์„œ๋“œ๋Š” df_customer ๊ฒฐ๊ณผ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์— ๋Œ€ํ•ด ์ธ์ˆ˜ 10์œผ๋กœ ํ˜ธ์ถœ๋œ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” ํ•˜์œ„ ์ง‘ํ•ฉ์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ๋ฐ˜ํ™˜ํ•˜๊ณ  ์ด๋ฅผ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.

์ฆ‰, ์ด ์ฝ”๋“œ๋Š” ์ „์ฒด์ ์œผ๋กœ df_customer์—์„œ 1%์˜ ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ์„ ์„ ํƒํ•˜๊ณ  ๊ทธ ์ƒ˜ํ”Œ์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์กฐ์ž‘ํ•˜์ง€ ์•Š๊ณ ๋„ ๋ฐ์ดํ„ฐ์˜ ์ž‘์€ ํ•˜์œ„ ์ง‘ํ•ฉ์— ๋Œ€ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ํƒ์ƒ‰ํ•˜๊ฑฐ๋‚˜ ๋น ๋ฅธ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•˜๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ ๋‹จ๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.

train_test_split์€ scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ ํ•จ์ˆ˜๋กœ, ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๋‘ ๊ฐœ์˜ ํ•˜์œ„ ์ง‘ํ•ฉ('ํ›ˆ๋ จ' ํ•˜์œ„ ์ง‘ํ•ฉ๊ณผ 'ํ…Œ์ŠคํŠธ' ํ•˜์œ„ ์ง‘ํ•ฉ)์œผ๋กœ ๋ถ„ํ• ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ์ด ์ฝ”๋“œ์—์„œ๋Š” df_customer๋ผ๋Š” pandas์˜ DataFrame์— ์ ์šฉํ•˜๊ณ , ์ธ์ˆ˜ test_size=0.1๋กœ ๋ฐ์ดํ„ฐ์˜ 10%๋ฅผ ํ…Œ์ŠคํŠธ์šฉ์œผ๋กœ ํ™•๋ณดํ•  ๊ฒƒ์„ ์ง€์ •ํ•˜๊ณ , ์ธ์ˆ˜ stratify=df_customer['gender_cd']๋กœ gender_cd ์—ด์˜ ๊ฐ’์œผ๋กœ ๊ณ„์ธตํ™”ํ•  ๊ฒƒ์„ ์ง€์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ๋‘ ๊ฐœ์˜ DataFrame์„ ๋ฐ˜ํ™˜ํ•˜์ง€๋งŒ, ์ฒซ ๋ฒˆ์งธ DataFrame์€ ๋ฌด์‹œ๋œ๋‹ค(_๊ฐ€ ํ• ๋‹น๋˜๋Š”๋ฐ, ์ด๋Š” ํŒŒ์ด์ฌ์˜ ๊ด€ํ–‰์œผ๋กœ, ์‚ฌ์šฉ๋˜์ง€ ์•Š์„ ๋ณ€์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ด๋‹ค).

train_test_split์ด ๋ฐ˜ํ™˜ํ•˜๋Š” ๋‘ ๋ฒˆ์งธ DataFrame์€ df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ํ• ๋‹น๋œ๋‹ค.

df_tmp์— ๋Œ€ํ•ด gender_cd๋ผ๋Š” ์ธ์ˆ˜๋กœ groupby ๋ฉ”์„œ๋“œ๊ฐ€ ํ˜ธ์ถœ๋œ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” df_tmp์˜ ํ–‰์„ gender_cd ์—ด์˜ ๊ฐ’์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•œ๋‹ค.

agg ๋ฉ”์„œ๋“œ๋Š” groupby ์—ฐ์‚ฐ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด {'customer_id' : 'count'}๋ผ๋Š” ์ธ์ˆ˜๋กœ ํ˜ธ์ถœ๋œ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” ๊ฐ ๊ทธ๋ฃน์˜ customer_id ์ปฌ๋Ÿผ์— ํ•จ์ˆ˜(์—ฌ๊ธฐ์„œ๋Š” count)๋ฅผ ์ ์šฉํ•˜๊ณ , ๊ฐ ๊ทธ๋ฃน์˜ ๊ฒฐ๊ณผ ์นด์šดํŠธ๋ฅผ ๊ฐ€์ง„ DataFrame์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

์™„์„ฑ๋œ DataFrame์ด ํ‘œ์‹œ๋œ๋‹ค.

์ฆ‰, ์ด ์ฝ”๋“œ๋Š” ์ „์ฒด์ ์œผ๋กœ pandas์˜ DataFrame์„ ํ›ˆ๋ จ์šฉ๊ณผ ํ…Œ์ŠคํŠธ์šฉ์œผ๋กœ ๋ฌด์ž‘์œ„๋กœ ๋ถ„ํ• ํ•˜๊ณ  ํŠน์ • ์—ด(gender_cd)์˜ ๊ฐ’์— ๋”ฐ๋ผ ๊ณ„์ธตํ™”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ…Œ์ŠคํŠธ์šฉ ์„œ๋ธŒ์…‹์—์„œ ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜์—ฌ gender_cd ์—ด๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ฐ ๊ทธ๋ฃน์˜ customer_id ์—ด์˜ ๊ณ ์œ ํ•œ ๊ฐ’์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํ…Œ์ŠคํŠธ์šฉ ์„œ๋ธŒ์…‹์˜ ๋ฐ์ดํ„ฐ์—์„œ ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ๊ณ ๊ฐ ๋ถ„ํฌ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_receipt๋ผ๋Š” pandas DataFrame ๊ฐ์ฒด์— ๋Œ€ํ•ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ํ•„ํ„ฐ๋ง์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค.

๋‹ค์Œ์€ ์ฝ”๋“œ์˜ ๋‹จ๊ณ„๋ณ„ ์„ค๋ช…์ž…๋‹ˆ๋‹ค.

df_sales_amount = df_receipt.groupby('customer_id').agg({'amount':'sum'}).reset_index(): ์ด ํ–‰์€ df_receipt์˜ ํ–‰์„ customer_id ์—ด์˜ ๊ฐ’์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , ๊ฐ ๊ทธ๋ฃน์˜ amount ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฃน์˜ amount ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์žˆ๋‹ค. ๊ฒฐ๊ณผ DataFrame df_sales_amount๋Š” ๊ณ ์œ ํ•œ customer_id ๊ฐ’๋งˆ๋‹ค 1ํ–‰, customer_id์™€ amount์˜ 2์—ด์„ ๊ฐ€์ง€๋ฉฐ, reset_index() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋ฃนํ™” ํ›„ ์ธ๋ฑ์Šค๋ฅผ ์ •์ˆ˜๊ฐ’์œผ๋กœ ์žฌ์„ค์ •ํ•˜์—ฌ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

df_sales_amount['log_sum_amount'] = np.log(df_sales_amount['amount'] + 0.5): ์ด ํ–‰์€ df_sales_amount์— log_sum_amount๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ ๊ณ ๊ฐ๋ณ„ ๊ธˆ์•ก์˜ ํ•ฉ๊ณ„ ์˜ ์ž์—ฐ๋Œ€์ˆ˜, ๊ทธ๋ฆฌ๊ณ  0์˜ ๋Œ€์ˆ˜๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด 0.5๋ฅผ ๋”ํ•œ ๊ฐ’์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

df_sales_amount['log_sum_amount_ss'] = preprocessing.scale(df_sales_amount['log_sum_amount']): ์ด ํ–‰์€ df_sales_amount์— log_sum_amount_ss ๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ๋งŒ๋“ค๊ณ , scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ preprocessing.scale ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ log_sum_amount๋ฅผ ์Šค์ผ€์ผ๋งํ•œ ๊ฐ’์„ ์ €์žฅํ•œ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ ๋ฐฐ์—ด์˜ ํ‰๊ท ์„ ๋นผ๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆ„์–ด ํ‰๊ท ์ด 0์ด๊ณ  ๋ถ„์‚ฐ์ด 0์ธ ๋‹จ์œ„์˜ ์ƒˆ๋กœ์šด ๋ฐฐ์—ด์„ ๋งŒ๋“ ๋‹ค.

df_sales_amount.query('abs(log_sum_amount_ss) > 3').head(10): ์ด ํ–‰์€ log_sum_amount_ss์˜ ์ ˆ๋Œ€๊ฐ’์ด 3๋ณด๋‹ค ํฐ ํ–‰๋งŒ ๋‚จ๋„๋ก df_sales_amount๋ฅผ ํ•„ํ„ฐ๋งํ•˜๊ณ  ๊ฒฐ๊ณผ์˜ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ์„ ํƒํ•œ๋‹ค. ์ด ํ•„ํ„ฐ๋Š” ์ด๋งค์ถœ์•ก์˜ ํ‰๊ท ๊ฐ’์—์„œ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 3 ์ด์ƒ ๋–จ์–ด์ ธ ์žˆ๋Š” ๊ณ ๊ฐ์„ ์„ ํƒํ•˜์—ฌ ์ด์ƒ๊ฐ’์ผ ๊ฐ€๋Šฅ์„ฑ์„ ํ‘œ์‹œํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, query ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์ž์—ด ํ‘œํ˜„์‹(์ด ๊ฒฝ์šฐ 'abs(log_sum_amount_ss) > 3')์„ ๊ธฐ์ค€์œผ๋กœ ํ–‰์„ ํ•„ํ„ฐ๋งํ•˜๊ณ  ์žˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_receipt๋ผ๋Š” pandas DataFrame ๊ฐ์ฒด์— ๋Œ€ํ•ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๋ฐ ํ•„ํ„ฐ๋ง์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋‹ค.

๋‹ค์Œ์€ ์ฝ”๋“œ์˜ ๋‹จ๊ณ„๋ณ„ ์„ค๋ช…์ž…๋‹ˆ๋‹ค.

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python').groupby('customer_id').agg({'amount':'sum '}).reset_index(): ์ด ํ–‰์€ ๋จผ์ € df_receipt๋ฅผ ํ•„ํ„ฐ๋งํ•˜์—ฌ customer_id ์—ด์ด 'Z'๋กœ ์‹œ์ž‘ํ•˜๋Š” ํ–‰์„ ์ œ์™ธํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, query ๋ฉ”์„œ๋“œ๋Š” ๋ฌธ์ž์—ด ํ‘œํ˜„์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ–‰์„ ํ•„ํ„ฐ๋งํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. python' ์ธ์ˆ˜๋Š” ์ฟผ๋ฆฌ์— Python ์—”์ง„์„ ์‚ฌ์šฉํ•˜๋„๋ก ์ง€์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋ฉฐ, str.startswith ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ฒฐ๊ณผ DataFrame์€ customer_id๋กœ ๊ทธ๋ฃนํ™”๋˜๋ฉฐ, ๊ทธ๋ฃน๋ณ„๋กœ ๊ธˆ์•ก ์—ด์˜ ํ•ฉ๊ณ„๊ฐ€ ๊ณ„์‚ฐ๋œ๋‹ค. ๊ฒฐ๊ณผ DataFrame df_sales_amount๋Š” ๊ณ ์œ ํ•œ customer_id ๊ฐ’๋งˆ๋‹ค ํ•œ ์ค„, customer_id์™€ amount ๋‘ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง€๋ฉฐ, reset_index() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋ฃนํ™” ํ›„ ์ธ๋ฑ์Šค๋ฅผ ์ •์ˆ˜ ๊ฐ’์œผ๋กœ ์žฌ์„ค์ • ํ•˜๊ณ  ์žˆ๋‹ค.

pct25 = np.percentile(df_sales_amount['amount'], q=25): ์ด ํ–‰์€ numpy.percentile ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_sales_amount์˜ amount ์ปฌ๋Ÿผ์˜ 25๋ฒˆ์งธ ๋ฐฑ๋ถ„์œ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ณ€์ˆ˜ pct25์— ๋Œ€์ž…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

pct75 = np.percentile(df_sales_amount['amount'], q=75): ์ด ํ–‰์€ numpy.percentile ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_sales_amount์˜ amount ์—ด์˜ 75๋ฒˆ์งธ ๋ฐฑ๋ถ„์œ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ pct75 ๋ณ€์ˆ˜์— ๋Œ€์ž…ํ•œ๋‹ค. ๋ณ€์ˆ˜์— ๋Œ€์ž…ํ•ฉ๋‹ˆ๋‹ค.

iqr = pct75 - pct25: ์ด ํ–‰์€ 75๋ฐฑ๋ถ„์œ„์ˆ˜(pct75)์—์„œ 25๋ฐฑ๋ถ„์œ„์ˆ˜(pct25)๋ฅผ ๋นผ์„œ df_sales_amount์˜ ๊ธˆ์•ก ์—ด์˜ ์‚ฌ๋ถ„์œ„์ˆ˜ ๋ฒ”์œ„(IQR)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ iqr ๋ณ€์ˆ˜์— ๋Œ€์ž…ํ•˜๊ณ  ์žˆ๋‹ค.

amount_low = pct25 - (iqr * 1.5): ์ด ํ–‰์€ amount_low = pct25 - (iqr * 1.5) ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ํŒ๋งค ๊ธˆ์•ก์˜ "์ •์ƒ" ๋ฒ”์œ„์˜ ํ•˜ํ•œ์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๊ฒƒ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐ•์Šคํ˜• ํ”Œ๋กฏ์˜ ์•„๋ž˜์ชฝ "์ˆ˜์—ผ"์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค.

amount_high = pct75 + (iqr * 1.5). ์ด ํ–‰์€ amount_high = pct75 + (iqr * 1.5) ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ํŒ๋งค ๊ธˆ์•ก์˜ "์ •์ƒ" ๋ฒ”์œ„์˜ ์ƒํ•œ์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๊ฒƒ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ƒ์ž ์ˆ˜์—ผ ๋„ํ‘œ์˜ ์ƒํ•œ "์ˆ˜์—ผ"์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋‹ค.

df_sales_amount.query('amount < @amount_low or @amount_high < amount').head(10): ์ด ํ–‰์€ amount ์—ด์ด amount_low๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ amount_high๋ณด๋‹ค ํฐ ํ–‰๋งŒ ๋‚จ๊ฒจ๋‘๋„๋ก df sales_amount๋ฅผ ํ•„ํ„ฐ๋งํ•˜์—ฌ ๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ์„ ํƒํ•œ๋‹ค. ์ด ํ•„ํ„ฐ๋Š” ๋งค์ถœ ๊ธˆ์•ก ํ•ฉ๊ณ„๊ฐ€ ๋งค์ถœ ๊ธˆ์•ก์˜ '์ •์ƒ' ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ๊ณ ๊ฐ์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์œผ๋กœ, ๋ฐ•์Šค ํ”Œ๋กฏ์˜ ์•„๋ž˜์ชฝ ์ˆ˜์—ผ๊ณผ ์œ„์ชฝ ์ˆ˜์—ผ ์‚ฌ์ด์˜ ๋ฒ”์œ„๋กœ ์ •์˜๋˜๋ฉฐ, Python์˜ ๋ณ€์ˆ˜ amount_low์™€ amount_high๋ฅผ ์ฐธ์กฐํ•˜๊ธฐ ์œ„ํ•ด @ ๊ธฐํ˜ธ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฌธ์ž์—ด ํ‘œํ˜„์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๋‹ค. ํ–‰์„ ํ•„ํ„ฐ๋งํ•˜๊ธฐ ์œ„ํ•ด query ๋ฉ”์„œ๋“œ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
ย 
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” pandas์˜ DataFrame df_product์˜ ๊ฐ ์—ด์˜ ๊ฒฐ์†(null) ๊ฐ’ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์—์„œ ๋‹จ๊ณ„๋ณ„๋กœ ์„ค๋ช…ํ•œ๋‹ค.

df_product: ์‚ฌ์šฉ ์ค‘์ธ DataFrame์˜ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค.

isnull(): ์ด ๋ฉ”์„œ๋“œ๋Š” df_product์™€ ๋™์ผํ•œ ๋ชจ์–‘์˜ DataFrame์„ ๋ฐ˜ํ™˜ํ•˜๋ฉฐ, ๊ฐ ์š”์†Œ๋Š” df_product์˜ ํ•ด๋‹น ์š”์†Œ๊ฐ€ null์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” boolean ๊ฐ’์ด๋‹ค.

sum(): isnull()์ด ๋ฐ˜ํ™˜ํ•˜๋Š” boolean DataFrame์— ์ ์šฉ๋˜์–ด ๊ฐ ์—ด์˜ boolean ๊ฐ’์˜ ํ•ฉ์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ฉ”์„œ๋“œ์ด๋ฉฐ, boolean ๊ฐ’์€ ์ •์ˆ˜๋กœ ์ทจ๊ธ‰๋˜๋ฏ€๋กœ ์‚ฌ์‹ค์ƒ df_product์˜ ๊ฐ ์—ด์˜ null ๊ฐ’์˜ ๊ฐœ์ˆ˜๋ฅผ ์„ธ๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

๋”ฐ๋ผ์„œ df_product.isnull().sum()์€ ์ธ๋ฑ์Šค๊ฐ€ df_product์˜ ์—ด ์ด๋ฆ„, ๊ฐ’์ด ๊ฐ ์—ด์˜ NULL ๊ฐ’ ์ˆ˜์ธ Series ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. ์ด๋Š” ๋ˆ„๋ฝ๋œ ๊ฐ’์ด ์žˆ๋Š” ์—ด๊ณผ ๊ฐ ์—ด์— ๋ช‡ ๊ฐœ์˜ ๋ˆ„๋ฝ๋œ ๊ฐ’์ด ์žˆ๋Š”์ง€๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” pandas์˜ DataFrame df_product์—์„œ NULL(๊ฒฐ์†) ๊ฐ’์„ ํฌํ•จํ•˜๋Š” ๋ชจ๋“  ํ–‰์„ ์ œ๊ฑฐํ•˜๊ณ , NULL์ด ์•„๋‹Œ ํ–‰๋งŒ ํฌํ•จํ•˜๋Š” ์ƒˆ๋กœ์šด DataFrame df_product_1์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์—์„œ ๋‹จ๊ณ„๋ณ„๋กœ ์„ค๋ช…ํ•œ๋‹ค.

df_product: ์‚ฌ์šฉํ•  ์›๋ณธ DataFrame์˜ ์ด๋ฆ„์ด๋‹ค.

df_product.copy(): ์ด ๋ฉ”์„œ๋“œ๋Š” ์›๋ณธ DataFrame์˜ ๋ณต์‚ฌ๋ณธ์„ ์ƒ์„ฑํ•˜์—ฌ ์›๋ณธ DataFrame์ด ๋ณ€๊ฒฝ๋˜์ง€ ์•Š๋„๋ก ํ•œ๋‹ค.

df_product_1: ์ƒ์„ฑ๋˜๋Š” ์ƒˆ๋กœ์šด DataFrame์˜ ์ด๋ฆ„์ด๋‹ค.

dropna(). ์ด ๋ฉ”์„œ๋“œ๋Š” df_product_1์—์„œ null ๊ฐ’์„ ํฌํ•จํ•œ ๋ชจ๋“  ํ–‰์„ ์‚ญ์ œํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋˜๋ฉฐ, inplace=True ์ธ์ˆ˜๋Š” ์ƒˆ๋กœ์šด DataFrame์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋Œ€์‹  df_product_1์„ ๊ทธ ์ž๋ฆฌ์—์„œ ๋ณ€๊ฒฝํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.

print('Before deletion:', len(df_product)): null ๊ฐ’์„ ์‚ญ์ œํ•˜๊ธฐ ์ „ ์›๋ž˜ DataFrame df_product์˜ ํ–‰ ์ˆ˜๋ฅผ ํ‘œ์‹œํ•œ๋‹ค.

print('After deletion:', len(df_product_1)): ์ด ์ค„์€ ๋„ ๊ฐ’์ด ์‚ญ์ œ๋œ ํ›„ ์ƒˆ๋กœ์šด DataFrame df_product_1์˜ ํ–‰ ์ˆ˜๋ฅผ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ด ์ฝ”๋“œ์—์„œ๋Š” ๋จผ์ € ์›๋ณธ DataFrame df_product์˜ ๋ณต์‚ฌ๋ณธ์„ ๋งŒ๋“ค๊ณ , ๋ณต์‚ฌํ•œ DataFrame์—์„œ NULL ๊ฐ’์„ ํฌํ•จํ•œ ๋ชจ๋“  ํ–‰์„ ์‚ญ์ œํ•˜์—ฌ ์ƒˆ๋กœ์šด DataFrame df_product_1์„ ๋งŒ๋“ค๊ณ , ๋งˆ์ง€๋ง‰์œผ๋กœ NULL ๊ฐ’ ์‚ญ์ œ ์ „๊ณผ ์‚ญ์ œ ํ›„์˜ ์›๋ณธ DataFrame๊ณผ ์ƒˆ๋กœ์šด DataFrame์˜ ํ–‰ ์ˆ˜๋ฅผ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค. DataFrame๊ณผ ์ƒˆ๋กœ์šด DataFrame ๋‚ด์˜ ํ–‰ ์ˆ˜๋ฅผ ํ‘œ์‹œํ•œ๋‹ค. ์ด ์ฝ”๋“œ๋Š” NULL ๊ฐ’์œผ๋กœ ์ธํ•ด ์‚ญ์ œ๋œ ํ–‰ ์ˆ˜๋ฅผ ํ™•์ธํ•˜๊ณ  NULL ๊ฐ’์„ ํฌํ•จํ•˜์ง€ ์•Š๋Š” ๊นจ๋—ํ•œ DataFrame์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

ย 

ย 
ย 
ย 

Comment