๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค 100๋ฒˆ์˜ ๋…ธํฌ(๊ตฌ์กฐํ™” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํŽธ) โ€“ Python Part 3 (Q41 to Q60)

๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค

๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค 100๋ฒˆ์˜ ๋…ธํฌ(๊ตฌ์กฐํ™” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํŽธ) โ€“ Python Part 3 (Q41 to Q60)์˜ ํ•ด์„ค์ž…๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ํ•œ ๊ธฐ์—…์˜ ์ผ์ผ ๋งค์ถœ์•ก์„ ๊ณ„์‚ฐํ•˜๊ณ , ๋งค์ถœ์•ก์„ ํ•˜๋ฃจ์”ฉ ๋’ค๋กœ ๋ฏธ๋ค„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋งŒ๋“ค๊ณ , ์ „๋‚  ๋งค์ถœ์•ก์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํ•˜๋ฃจ์˜ ๋งค์ถœ ๊ธˆ์•ก๊ณผ ์ „๋‚ ์˜ ๋งค์ถœ ๊ธˆ์•ก์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ 'diff_amount'๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜์—์„œ ํ•œ ์ค„์”ฉ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

df_sales_amount_by_date = df_receipt[['sales_ymd', 'amount']].groupby('sales_ymd').sum().reset_index(): ์ด ์ฝ”๋“œ๋Š” 'df_receipt' ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์—์„œ ' sales_ymd'์™€ 'amount' ์—ด์„ ์„ ํƒํ•˜์—ฌ 'sales_ymd' ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ 'sales_ymd' ๊ทธ๋ฃน์˜ 'amount' ์ปฌ๋Ÿผ์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜์—ฌ 'sales_ymd'์™€ 'amount' ์ปฌ๋Ÿผ์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ 'df_sales_amount_by_date'๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

df_sales_amount_by_date = pd.concat([df_sales_amount_by_date, df_sales_amount_by_date.shift()], axis=1): ์ด ์ฝ”๋“œ๋Š” 'df_sales_amount_by_date' ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ 'df_sales_amount_by_date' ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ date' dataframe์„ ์—ด์˜ ์ถ•์„ ๋”ฐ๋ผ(์ฆ‰, ์ˆ˜ํ‰์œผ๋กœ) ์ด๋™ํ•œ ๋ฒ„์ „๊ณผ ์—ฐ๊ฒฐํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ 4๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด ๋งŒ๋“ค์–ด์ง„๋‹ค. sales_ymd', 'amount', 'lag_ymd', 'lag_amount' 4๊ฐœ์˜ ์—ด์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์ด ๋œ๋‹ค.

df_sales_amount_by_date.columns = ['sales_ymd','amount','lag_ymd','lag_amount']: ์ด ์ฝ”๋“œ๋Š” 'df_sales_amount_by_date' ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ปฌ๋Ÿผ ์ด๋ฆ„์„ ๋” ์˜๋ฏธ ์žˆ๋Š” ์ด๋ฆ„์œผ๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค.

df_sales_amount_by_date['diff_amount'] = df_sales_amount_by_date['amount'] - df_sales_amount_by_date['lag_amount']: 'df_sales_amount_by_date' dataframe์˜ amount_by_date' dataframe์— 'amount' ์ปฌ๋Ÿผ์—์„œ 'lag_amount' ์ปฌ๋Ÿผ์„ ๋นผ์„œ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ 'diff_amount'๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

df_sales_amount_by_date.head(10). ์ด ์ฝ”๋“œ๋Š” 'df_sales_amount_by_date' dataframe์˜ ์ฒ˜์Œ 10์ค„์„ ํ‘œ์‹œํ•œ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋งค์ถœ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ง€์—ฐ ๋ถ„์„์„ ์‹คํ–‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋จผ์ € ์ด ์ฝ”๋“œ๋Š” ๋งค์ถœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚ ์งœ๋ณ„๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ฐ ๋‚ ์งœ์˜ ๋งค์ถœ ๊ธˆ์•ก์„ ํ•ฉ์‚ฐํ•˜์—ฌ ๊ฐ ๋‚ ์งœ์˜ ๋งค์ถœ ๊ธˆ์•ก์˜ ํ•ฉ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” df_sales_amount_by_date์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ 1~3๊นŒ์ง€ 3ํšŒ ๋ฐ˜๋ณตํ•˜๋Š” for ๋ฃจํ”„๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

for ๋ฃจํ”„์˜ ๊ฐ ๋ฐ˜๋ณต์—์„œ df_sales_amount_by_date์˜ DataFrame์„ iํ–‰์”ฉ ์–ด๊ธ‹๋‚˜๊ฒŒ ์—ฐ๊ฒฐํ•˜์—ฌ df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๊ฐ ํ–‰์ด ํŠน์ • ๋‚ ์งœ์˜ ๋งค์ถœ ๋ฐ์ดํ„ฐ์™€ ํ•œ ๋‚ ์งœ ์ด์ „ ๋‚ ์งœ์˜ ๋งค์ถœ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๋Š” DataFrame์ด ๋œ๋‹ค.

i๊ฐ€ 1์ด๋ฉด ๊ฒฐ๊ณผ DataFrame์€ df_lag์— ์ €์žฅ๋œ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ, DataFrame์€ df_lag์— ์ถ”๊ฐ€๋œ๋‹ค.

df_lag์˜ ์—ด์€ ํ˜„์žฌ ๋‚ ์งœ์™€ ์ง€์—ฐ๋œ ๋‚ ์งœ์˜ ๋งค์ถœ์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด sales_ymd, amount, lag_ymd, lag_amount๋กœ ์ด๋ฆ„์ด ๋ณ€๊ฒฝ๋œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ NaN ๊ฐ’์ด ํฌํ•จ๋œ ํ–‰(์ฒซ ๋ฒˆ์งธ iํ–‰์—์„œ ๋ฐœ์ƒ)์„ ์‚ญ์ œํ•˜๊ณ  ๋งค์ถœ ๊ธˆ์•ก์„ ์ •์ˆ˜๋กœ ๋ณ€ํ™˜ํ•œ ํ›„ sales_ymd์™€ lag_ymd๋ฅผ ๊ธฐ์ค€์œผ๋กœ DataFrame์„ ์ •๋ ฌํ•˜์—ฌ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ ๋‚ ์งœ์˜ ๋งค์ถœ ๊ธˆ์•ก ํ•ฉ๊ณ„์™€ 1์ผ ์ „, 2์ผ ์ „, 3์ผ ์ „์˜ ๋งค์ถœ ๊ธˆ์•ก ํ•ฉ๊ณ„๊ฐ€ ํ‘œ์‹œ๋˜์–ด ์‚ฌ์šฉ์ž๋Š” ๋งค์ถœ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ง€์—ฐ ๋ถ„์„์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.
ย 
ํ•ด์„ค:
์ œ๊ณต๋œ ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.

df_receipt์˜ DataFrame์„ ์‚ฌ์šฉํ•˜์—ฌ sales_ymd์™€ amount ์—ด์„ ์„ ํƒํ•˜๊ณ , sales_ymd๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , amount ์—ด์— sum() ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ฐ ๋‚ ์งœ์˜ ์ด๋งค์ถœ์•ก์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์™„์„ฑ๋œ DataFrame์€ df_sales_amount_by_date์— ์ €์žฅ๋œ๋‹ค.

df_sales_amount_by_date์˜ ๋ณต์‚ฌ๋ณธ์œผ๋กœ df_lag๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame์ด ์ƒ์„ฑ๋œ๋‹ค.

๋ฃจํ”„๊ฐ€ 3ํšŒ(1~4ํšŒ) ์‹คํ–‰๋˜๊ณ , ๊ฐ ๋ฐ˜๋ณต์—์„œ shift() ๋ฉ”์„œ๋“œ๊ฐ€ ์‚ฌ์šฉ๋˜์–ด df_sales_amount_by_date DataFrame์„ i์ฃผ๊ธฐ(์ผ) ๋งŒํผ ์‹œ์ฐจ๋ฅผ ๋‘”๋‹ค. ์–ป์–ด์ง„ DataFrame์€ pd.concat()์„ ์‚ฌ์šฉํ•˜์—ฌ df_lag์™€ ์ˆ˜ํ‰์œผ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ df_lag์— ์ €์žฅํ•œ๋‹ค.

df_lag DataFrame์— ์ถ”๊ฐ€๋  ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์˜ ์ด๋ฆ„์„ ์ €์žฅํ•˜๊ธฐ ์œ„ํ•ด columns ๋ณ€์ˆ˜๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค.

df_lag์˜ ์ปฌ๋Ÿผ ์ด๋ฆ„์€ columns ๋ฆฌ์ŠคํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ ์—…๋ฐ์ดํŠธ๋˜๊ณ , ๊ฒฐ๊ณผ DataFrame์€ df_lag์— ์ €์žฅ๋œ๋‹ค.

Dropna() ๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ํ–‰์„ ์‚ญ์ œํ•˜๊ณ , astype() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ DataFrame์„ ์ •์ˆ˜ํ˜•์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ sort_values() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ DataFrame์„ sales_ymd์˜ ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•˜๊ณ , head() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ DataFrame์˜ ์ฒซ 10๊ฐœ์˜ ํ–‰์„ ๋ฐ˜ํ™˜ํ•˜๊ณ  ์žˆ๋‹ค.

์š”์•ฝํ•˜๋ฉด, ์ด ์ฝ”๋“œ์—์„œ๋Š” ํŒ๋งค ์‹ค์ ์˜ ์ถ”์„ธ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋‚ ์งœ์˜ ํŒ๋งค ๊ธˆ์•ก๊ณผ ์ง€๋‚œ 3์ผ๊ฐ„์˜ ํŒ๋งค ๊ธˆ์•ก์„ ํฌํ•จํ•˜๋Š” DataFrame df_lag๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์™„์„ฑ๋œ DataFrame์€ ๊ฐ ๋‚ ์งœ์˜ ํŒ๋งค ๊ธˆ์•ก๊ณผ ์ง€๋‚œ 3์ผ๊ฐ„์˜ ํŒ๋งค ๊ธˆ์•ก์„ ํŒ๋งค์ผ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ df_receipt์™€ df_customer๋ฅผ ๊ณตํ†ต ์—ด customer_id๋กœ ๋‚ด๋ถ€ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณ‘ํ•ฉํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ df_tmp์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

df_tmp ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ์ƒˆ๋กœ์šด ์—ด era๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. era๋Š” ๊ฐ ๊ณ ๊ฐ์˜ ๋‚˜์ด๋ฅผ 10์œผ๋กœ ๋‚˜๋ˆ„๊ณ  10 ๋ฏธ๋งŒ์€ ๋ฐ˜์˜ฌ๋ฆผํ•˜์—ฌ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

era์™€ gender_cd ๊ฐ’์˜ ๊ณ ์œ ํ•œ ์กฐํ•ฉ๋งˆ๋‹ค df_tmp ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ amount ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ pd.pivot_table() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ df_sales_summary๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์€ era ๊ฐ’์„ ์ธ๋ฑ์Šค๋กœ, gender_cd ๊ฐ’์„ ์—ด๋กœ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

df_sales_summary ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ์—ด ์ด๋ฆ„์„ ์ข€ ๋” ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ์ด๋ฆ„์œผ๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์—ด์€ era์ด๊ณ , ๊ทธ ๋’ค์— male, female, unknown์ด ์ด์–ด์ง€๋ฉฐ ๊ฐ gender_cd ๊ฐ’์˜ ๊ธˆ์•ก ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.
ย 
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_receipt์™€ df_customer ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋‚ด๋ถ€ ๊ฒฐํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณตํ†ต ์—ด 'customer_id'์—์„œ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์€ df_tmp๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ์ €์žฅ๋œ๋‹ค.

'age' ์—ด์„ 10์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ธต์ˆ˜๋ฅผ ๊ตฌํ•˜๊ณ  10์„ ๊ณฑํ•˜์—ฌ 10๋…„์„ ๊ตฌํ•˜์—ฌ df_tmp์— ์ƒˆ๋กœ์šด 'era' ์—ด์„ ์ƒ์„ฑํ•œ๋‹ค. ์ด ์—ด์€ ๊ฐ ๊ณ ๊ฐ์˜ ์—ฐ๋ น๋Œ€๋ฅผ ์ˆ˜์‹ญ ๋…„ ๋‹จ์œ„๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค.

pd.pivot_table() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_sales_summary๋ผ๋Š” ํ”ผ๋ฒ— ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•œ๋‹ค. ์ด ํ”ผ๋ฒ— ํ…Œ์ด๋ธ”์€ df_tmp ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, 'era' ์—ด๋กœ ์ธ๋ฑ์‹ฑ๋œ ํ–‰, 'gender_cd' ์—ด๋กœ ๊ทธ๋ฃนํ™”๋œ ์—ด, sum ์ง‘๊ณ„ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 'amount' ์—ด์—์„œ ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ ํ‘œ๋Š” ๊ฐ ์—ฐ๋ น๋Œ€์˜ ๊ฐ ์„ฑ๋ณ„์— ๋Œ€ํ•œ ํŒ๋งค ๊ธˆ์•ก์˜ ํ•ฉ๊ณ„๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

df_sales_summary์˜ ์ปฌ๋Ÿผ ์ด๋ฆ„์„ ์ข€ ๋” ์•Œ๊ธฐ ์‰ฝ๊ฒŒ ๋ณ€๊ฒฝํ•˜์—ฌ 'male', 'female', 'unknown'์ด ๊ฐ ์—ฐ๋ น๋Œ€์˜ ๊ฐ ์„ฑ๋ณ„์˜ ์ด ๋งค์ถœ์•ก์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” pandas์˜ DataFrame df_sales_summary๋ฅผ ์ธ๋ฑ์Šค๋ฅผ era๋กœ ์„ค์ •ํ•˜๊ณ , ์ƒˆ๋กœ์šด ์นผ๋Ÿผ gender_cd์— ์—ฌ์„ฑ, ๋‚จ์„ฑ, ์•Œ ์ˆ˜ ์—†์Œ์„ ์Œ“๊ณ  ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜์—ฌ ๋ณ€ํ™˜ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  gender_cd์˜ ๊ฐ’์„ 01(์—ฌ์„ฑ), 00(๋‚จ์„ฑ), 99(์•Œ ์ˆ˜ ์—†์Œ)๋กœ ๋Œ€์ฒดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ level_1์„ gender_cd๋กœ, 0์„ amount๋กœ ์ปฌ๋Ÿผ ์ด๋ฆ„์„ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์€ ๋‹จ๊ณ„๋ณ„ ์„ค๋ช…์ž…๋‹ˆ๋‹ค.

df_sales_summary.set_index('era'): df_sales_summary์˜ ์ธ๋ฑ์Šค๋ฅผ era๋ผ๋Š” ์ปฌ๋Ÿผ์œผ๋กœ ์„ค์ •ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด era๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ๊ทธ๋ฃนํ™”ํ•˜๊ฑฐ๋‚˜ ํ”ผ๋ฒ—ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

.stack(): female, male, unknown์˜ ๊ฐ ์ปฌ๋Ÿผ์„ ํ•˜๋‚˜์˜ ์ปฌ๋Ÿผ์œผ๋กœ ์Šคํƒํ•˜๊ณ , ์Šคํƒ๋œ ์ปฌ๋Ÿผ์— ์ƒˆ๋กœ์šด ์ธ๋ฑ์Šค ๋ ˆ๋ฒจ์„ ์ถ”๊ฐ€ํ•œ๋‹ค.

.reset_index(): DataFrame์˜ ์ธ๋ฑ์Šค๋ฅผ ๊ธฐ๋ณธ ์ •์ˆ˜ ์ธ๋ฑ์Šค๋กœ ์žฌ์„ค์ •ํ•œ๋‹ค.

.replace({'female':'01','male':'00','unknown':'99'}): gender_cd ์ปฌ๋Ÿผ์˜ ๊ฐ’์„ ํ•ด๋‹น ์ฝ”๋“œ๋กœ ๋Œ€์ฒดํ•œ๋‹ค.

.rename(columns={'level_1':'gender_cd', 0: 'amount'}): level_1 ์ปฌ๋Ÿผ์„ gender_cd๋กœ, 0์„ amount๋กœ ์ด๋ฆ„์„ ๋ฐ”๊พผ๋‹ค, ์Šคํƒ๋œ ์ปฌ๋Ÿผ์˜ ์›๋ž˜ ์ปฌ๋Ÿผ ์ด๋ฆ„์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ์—์„œ๋Š” ๊ธฐ์กด df_customer DataFrame์˜ customer_id ์ปฌ๋Ÿผ๊ณผ birth_day ์ปฌ๋Ÿผ(๋‚ ์งœ ์ •๋ณด) ๋‘ ์ปฌ๋Ÿผ์„ ์—ฐ๊ฒฐํ•˜์—ฌ ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

birth_day ์—ด์— pd.to_datetime() ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , dt.strftime('%Y%m%d') ๋ฉ”์„œ๋“œ๋ฅผ ์ ์šฉํ•˜์—ฌ datetime์„ '%Y%m%d'(์ฆ‰, ๋…„, ์›”, ์ผ) ํ˜•์‹์˜ ๋ฌธ์ž์—ด๋กœ ํฌ๋งทํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ƒˆ๋กœ์šด DataFrame์— ๊ณ ๊ฐ์˜ ์ƒ๋…„์›”์ผ์„ ๋ฌธ์ž์—ด๋กœ ํ‘œํ˜„ํ•œ ์ƒˆ๋กœ์šด ์—ด์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ head() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ DataFrame์˜ ์ฒซ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ ์Šค๋‹ˆํŽซ์€ pd.concat() ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ pandas DataFrame df_customer์˜ ๋‘ ์ปฌ๋Ÿผ('customer_id'์™€ 'application_date')์„ ์ƒˆ๋กœ์šด DataFrame์œผ๋กœ ์—ฐ๊ฒฐํ•œ๋‹ค.

pd.to_datetime() ๋ฉ”์„œ๋“œ๋Š” 'application_date' ์ปฌ๋Ÿผ์„ datetime ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ์ด ๋ฉ”์„œ๋“œ๋Š” ๋‹ค์–‘ํ•œ ํ˜•์‹์˜ ๋‚ ์งœ ๋ฌธ์ž์—ด์„ pandas์˜ ํ‘œ์ค€ datetime ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํŽธ๋ฆฌํ•œ ๋ฐฉ๋ฒ•์ด๋‹ค.

๊ฒฐ๊ณผ DataFrame์€ 'customer_id'์™€ 'application_date' ๋‘ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง€๋ฉฐ, ๋‚ ์งœ๋Š” datetime ํ˜•์‹์œผ๋กœ ํ‘œ์‹œ๋œ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” pd.concat ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๊ฐœ์˜ pandas ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๊ฒฐํ•ฉํ•œ๋‹ค.

df_receipt[['receipt_no', 'receipt_sub_no']]: df_receipt ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์—์„œ receive_no์™€ receive_sub_no ์ปฌ๋Ÿผ์„ ์„ ํƒํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

pd.to_datetime(df_receipt['sales_ymd'].astype('str')): df_receipt dataframe์˜ sales_ymd ์ปฌ๋Ÿผ์„ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค.

pd.concat์˜ axis=1์ด๋ผ๋Š” ์ธ์ˆ˜๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ˆ˜ํ‰์œผ๋กœ ์—ด ๋ฐฉํ–ฅ์œผ๋กœ ์—ฐ๊ฒฐํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์™„์„ฑ๋œ dataframe์€ receive_no, receive_sub_no, sales_ymd(datetime ํ˜•์‹)์˜ ์—ด์„ ๊ฐ€์ง€๋ฉฐ, ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์ด .head(10) ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์‹œ๋œ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_receipt DataFrame์—์„œ receipt_no์™€ receipt_sub_no ๋‘ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ to_datetime() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ sales_epoch ์ปฌ๋Ÿผ์„ datetime ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , unit ํŒŒ๋ผ๋ฏธํ„ฐ์—๋Š” ์ž…๋ ฅ์ด ์ดˆ ๋‹จ์œ„์ž„์„ ๋‚˜ํƒ€๋‚ด๋Š” 's'๋ฅผ ์„ค์ •ํ•œ๋‹ค.

์™„์„ฑ๋œ datetime ์ปฌ๋Ÿผ์€ pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ rename() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 'sales_ymd'๋กœ ์ด๋ฆ„์„ ๋ฐ”๊พผ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ concat() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‘ ๊ฐœ์˜ DataFrame์„ ์—ด์˜ ์ถ•(axis=1)์„ ๋”ฐ๋ผ ์—ฐ๊ฒฐํ•œ๋‹ค.

์™„์„ฑ๋œ DataFrame์€ receive_no, receive_sub_no, sales_ymd์˜ ์„ธ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง€๋ฉฐ, sales_ymd๋Š” datetime ํ˜•์‹์˜ ํŒ๋งค ๋‚ ์งœ ๋ฐ ์‹œ๊ฐ„์œผ๋กœ, head(10) ํ•จ์ˆ˜๋Š” ๊ฒฐ๊ณผ DataFrame์˜ ์ฒซ ๋ฒˆ์งธ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œ ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_receipt์˜ receive_no์™€ receive_sub_no ์—ด๊ณผ to_datetime ๋ฉ”์„œ๋“œ๋กœ datetime ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜ํ•œ sales_epoch ์—ด์˜ ์—ฐ๋„๋ฅผ ์—ฐ๊ฒฐํ•˜์—ฌ ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜๊ณ , ๋‹จ์œ„ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ s๋กœ ์„ค์ •ํ•˜์—ฌ ์„ค์ •ํ•˜์—ฌ sales_epoch์˜ ๊ฐ’์ด ์œ ๋‹‰์Šค ํƒ€์ž„์Šคํƒฌํ”„(์ฆ‰, 1970๋…„ 1์›” 1์ผ๋ถ€ํ„ฐ์˜ ์ดˆ ๋‹จ์œ„)์ž„์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ํ•œ ์ค„์”ฉ ๋ถ„ํ•ดํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

pd.concat: ๋‘ ๊ฐœ์˜ DataFrame์„ ์—ด์˜ ์ถ•์„ ๋”ฐ๋ผ ์—ฐ๊ฒฐํ•œ๋‹ค.

df_receipt[['receipt_no', 'receipt_sub_no']]: df_receipt DataFrame์—์„œ receipt_no์™€ receipt_sub_no ์ปฌ๋Ÿผ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

pd.to_datetime(df_receipt['sales_epoch'], unit=s'): to_datetime ๋ฉ”์„œ๋“œ์—์„œ sales_epoch ์ปฌ๋Ÿผ์˜ ๊ฐ’์„ datetime ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  unit ํŒŒ๋ผ๋ฏธํ„ฐ์— s๋ฅผ ์„ค์ •ํ•œ๋‹ค.

dt.year: datetime ๊ฐ์ฒด์˜ year ์ปดํฌ๋„ŒํŠธ๋ฅผ ์ถ”์ถœํ•œ๋‹ค.

rename('sales_year'): ๊ฒฐ๊ณผ ์ปฌ๋Ÿผ ์ด๋ฆ„์„ sales_year๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค.

axis=1: ์—ด ๋ฐฉํ–ฅ์œผ๋กœ ์—ฐ๊ฒฐํ•˜๋„๋ก ์ง€์ •ํ•œ๋‹ค.

.head(10): ๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
ย 

ย 

ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_receipt DataFrame์—์„œ sales_epoch ์ปฌ๋Ÿผ์„ ๊ฐ€์ ธ์™€ 1970๋…„ 1์›” 1์ผ(์œ ๋‹‰์Šค ์‹œ๊ฐ„์ด๋ผ๊ณ ๋„ ํ•จ)๋ถ€ํ„ฐ์˜ ์ดˆ๋ฅผ ์ €์žฅํ•˜๊ณ  pd.to_datetime()์„ ์‚ฌ์šฉํ•˜์—ฌ ์ดˆ ๋‹จ์œ„์˜ ์ •๋ฐ€๋„๋ฅผ ๊ฐ€์ง„ Pandas datetime ๊ฐ์ฒด๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์™„์„ฑ๋œ datetime ๊ฐ์ฒด๋Š” sales_day๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ํ• ๋‹น๋œ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ strftime() ๋ฉ”์„œ๋“œ๊ฐ€ sales_day ์ปฌ๋Ÿผ์— ์ ์šฉ๋˜์–ด 2์ž๋ฆฌ ์ˆซ์ž์™€ ์•ž์˜ 0(%d๋กœ ํ‘œ์‹œ๋จ)์˜ ํ˜•ํƒœ๋กœ ๋‚ ์งœ์˜ ์ผ ์„ฑ๋ถ„์„ ์ถ”์ถœํ•œ๋‹ค. ์–ป์–ด์ง„ ๋ฌธ์ž์—ด์€ pd.concat()์„ ์‚ฌ์šฉํ•˜์—ฌ recipate_no ๋ฐ recipate_sub_no ์—ด๊ณผ ์—ฐ๊ฒฐํ•˜์—ฌ ์„ธ ๊ฐœ์˜ ์—ด์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฒฐ๊ณผ DataFrame์— ๋Œ€ํ•ด head() ๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•œ๋‹ค. ์ด ์ฝ”๋“œ์—์„œ๋Š” df_receipt์˜ sales_epoch ์—ด์—์„œ ์›”๊ณผ ๋‚ ์งœ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ ์˜์ˆ˜์ฆ ๋ฒˆํ˜ธ์™€ ๊ฒฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•˜๊ณ  ์žˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_receipt์—์„œ query ๋ฉ”์„œ๋“œ๋กœ customer_id๊ฐ€ 'Z'๋กœ ์‹œ์ž‘ํ•˜์ง€ ์•Š๋Š” ํ–‰์„ ์„ ํƒํ•˜์—ฌ df_sales_amount์— ๋Œ€์ž…ํ•œ๋‹ค.

df_sales_amount์—์„œ customer_id์™€ amount ์—ด์„ ์„ ํƒํ•˜๊ณ  groupby ๋ฉ”์„œ๋“œ๋กœ customer_id๋ณ„๋กœ ๊ธˆ์•ก์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋ฅผ df_sales_amount์— ๋Œ€์ž…ํ•œ๋‹ค.

df_sales_amount์— ๋žŒ๋‹ค ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•œ apply ๋ฉ”์„œ๋“œ๋กœ ๊ธˆ์•ก์ด 2000๋ณด๋‹ค ํฌ๋ฉด 1, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด 0์„ ํฌํ•จํ•˜๋Š” ์ƒˆ๋กœ์šด ์—ด sales_flg๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค.

์™„์„ฑ๋œ DataFrame df_sales_amount์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•œ๋‹ค.

์ฆ‰, ๊ฒฐ๊ณผ DataFrame df_sales_amount์—๋Š” customer_id, ๊ฐ ๊ณ ๊ฐ์˜ ์ด ์ง€์ถœ์•ก, ๊ณ ๊ฐ์˜ ์ง€์ถœ์ด ๋งŽ์€์ง€(2000๋ณด๋‹ค ํฐ์ง€) ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ”์ด๋„ˆ๋ฆฌ ํ”Œ๋ž˜๊ทธ sales_flg๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.
ย 
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๊ณ ๊ฐ์˜ ๋งค์ถœ ์ •๋ณด ์š”์•ฝ ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” ๊ฐ ํ–‰์˜ ์ฒ˜๋ฆฌ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

df_sales_amount = df_receipt.query('not customer_id.str.startswith("Z")', engine='python'): ์ด ํ–‰์€ ID๊ฐ€ "Z"๋กœ ์‹œ์ž‘ํ•˜๋Š” ๊ณ ๊ฐ์„ ์ œ์™ธํ•˜๋„๋ก ์˜์ˆ˜์ฆ ๋ฐ์ดํ„ฐ๋ฅผ ํ•„ํ„ฐ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” query ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•˜๊ณ , engine='python'์„ ์ง€์ •ํ•˜์—ฌ ๊ฒฝ๊ณ  ๋ฉ”์‹œ์ง€๋ฅผ ํ”ผํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํ•„ํ„ฐ๋ง๋œ ๋ฐ์ดํ„ฐ๋Š” df_sales_amount๋ผ๋Š” ์ƒˆ๋กœ์šด dataframe์— ํ• ๋‹น๋ฉ๋‹ˆ๋‹ค.

df_sales_amount = df_sales_amount[['customer_id', 'amount']].groupby('customer_id').sum().reset_index(): ์ด ๋ผ์ธ์€ df_sales_amount ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ customer_id๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ฐ ๊ทธ๋ฃน์˜ amount ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , customer_id๋ฅผ ๋‹ค์‹œ ์—ด๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ๊ฐ ๊ณ ๊ฐ์˜ ๋งค์ถœ ๊ธˆ์•ก์˜ ํ•ฉ๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

df_sales_amount['sales_flg'] = np.where(df_sales_amount['amount'] > 2000, 1, 0): ์ด ํ–‰์€ df_sales_amount ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— sales_flg๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ์ถ”๊ฐ€ํ•œ๋‹ค. ์ด ์—ด์˜ ๊ฐ’์€ ๊ณ ๊ฐ์˜ ์ด ๋งค์ถœ ๊ธˆ์•ก(amount)์ด 2000๋ณด๋‹ค ํฐ์ง€ ์•„๋‹Œ์ง€์— ๋”ฐ๋ผ ์„ค์ •๋œ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” np.where ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์กฐ๊ฑด์„ ์ ์šฉํ•˜๊ณ  ์ƒˆ ์—ด์— 1 ๋˜๋Š” 0์„ ํ• ๋‹นํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

df_sales_amount.head(10): ์ด ํ–‰์€ df_sales_amount ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ customer_id, ๋งค์ถœ ์ด์•ก, ๊ณ ๊ฐ์˜ sales_flg๊ฐ€ 1(๋งค์ถœ ๊ธˆ์•ก์ด 2000๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ) ๋˜๋Š” 0(๋งค์ถœ ๊ธˆ์•ก์ด 2000 ์ดํ•˜์ธ ๊ฒฝ์šฐ)์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ…Œ์ด๋ธ”์ด ํ‘œ์‹œ๋œ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

"df_customer" DataFrame์—์„œ "customer_id"์™€ "postal_cd" ์ปฌ๋Ÿผ์„ ๋ณต์‚ฌํ•˜์—ฌ "df_tmp"๋ผ๋Š” ์ƒˆ๋กœ์šด DataFrame์— ๋Œ€์ž…ํ•ฉ๋‹ˆ๋‹ค.

"postal_cd" ์ปฌ๋Ÿผ์˜ ์ฒซ 3์ž๋ฆฌ๋ฅผ ๊ธฐ์ค€์œผ๋กœ "df_tmp"์— ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ "postal_flg"๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ฒ˜์Œ 3์ž๋ฆฌ๊ฐ€ 100์—์„œ 209 ์‚ฌ์ด์ด๋ฉด "postal_flg"๋Š” 1๋กœ ์„ค์ •๋˜๊ณ , ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด 0์œผ๋กœ ์„ค์ •๋œ๋‹ค.

"customer_id" ์ปฌ๋Ÿผ์˜ "df_tmp"์™€ "df_receipt"๋ฅผ ๋‚ด๋ถ€ ๊ฒฐํ•ฉ์œผ๋กœ ๊ฒฐํ•ฉํ•œ๋‹ค.

๊ฒฐ๊ณผ DataFrame์„ "postal_flg" ์ปฌ๋Ÿผ์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , "nunique()" ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ทธ๋ฃน ๋‚ด ๊ณ ์œ ํ•œ ๊ณ ๊ฐ ID์˜ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

'postal_flg' ์—ด๋กœ ๊ทธ๋ฃนํ™”๋œ ๊ณ ์œ  ๊ณ ๊ฐ ID์˜ ๊ฐœ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒฐ๊ณผ DataFrame์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

์š”์•ฝํ•˜๋ฉด, ์ด ์ฝ”๋“œ๋Š” ๊ตฌ๋งค ๋‚ด์—ญ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํŠน์ • ์šฐํŽธ๋ฒˆํ˜ธ ์ง€์—ญ์— ๊ฑฐ์ฃผํ•˜๋Š” ๊ณ ๊ฐ ์ˆ˜๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๊ณ ๊ฐ์˜ ์šฐํŽธ๋ฒˆํ˜ธ๊ฐ€ ํŠน์ • ๋ฒ”์œ„์— ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” postal_flg ์—ด์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ df_tmp๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ๋‹ค์Œ์€ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๋‹จ๊ณ„๋ณ„ ์„ค๋ช…์ž…๋‹ˆ๋‹ค.

df_tmp = df_customer[['customer_id', 'postal_cd']].copy() df_customer์—์„œ customer_id์™€ postal_cd ์—ด๋งŒ ํฌํ•จํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ df_tmp๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

df_tmp['postal_flg'] = np.where(df_tmp['postal_cd'].str[0:3].astype(int).between(100, 209), 1, 0) df_tmp์— np.where ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ postal_flg ์—ด์„ ์ƒˆ๋กœ ์ƒ์„ฑํ•œ๋‹ค. str[0:3] ์ฝ”๋“œ์—์„œ ์šฐํŽธ๋ฒˆํ˜ธ์˜ ์ฒซ 3๊ธ€์ž๋ฅผ ๋ฌธ์ž์—ด๋กœ ๊ฐ€์ ธ์™€ astype(int)๋กœ ์ •์ˆ˜๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , between ํ•จ์ˆ˜๋Š” ๊ทธ ์ •์ˆ˜๊ฐ€ 100์—์„œ 209 ๋ฒ”์œ„์— ์žˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ๊ทธ๋ ‡๋‹ค๋ฉด 1์ด postal_flg์— ๋Œ€์ž…๋˜๊ณ , ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด 0์ด ๋Œ€์ž…๋ฉ๋‹ˆ๋‹ค.

pd.merge(df_tmp, df_receipt, how='inner', on='customer_id') df_tmp์™€ df_receipt๋ฅผ customer_id ์—ด๋กœ ๋ณ‘ํ•ฉํ•˜์—ฌ ๊ณ ๊ฐ์ด ๋‘ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ๋ชจ๋‘ ์กด์žฌํ•˜๋Š” ํ–‰๋งŒ ๋‚จ๊ธด๋‹ค.

groupby('postal_flg').agg({'customer_id':'nunique'}) ๋ณ‘ํ•ฉํ•œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ postal_flg๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , nunique ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ทธ๋ฃน ๋‚ด ๊ณ ์œ ํ•œ customer_id ๊ฐ’์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ฐ postal_flg์— ๋Œ€ํ•ด ํ•œ ์ค„์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ๋ฐ˜ํ™˜ํ•˜์—ฌ ๊ฐ ๊ทธ๋ฃน์˜ ๊ณ ๊ฐ ์ˆ˜๋ฅผ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” df_customer DataFrame์˜ address ์—ด์—์„œ ๋„๋„๋ถ€ํ˜„ ์ฝ”๋“œ๋ฅผ ์ถ”์ถœํ•˜์—ฌ 'customer_id', 'address', 'prefecture_cd' ์—ด์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด DataFrame df_customer_tmp๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์ฝ”๋“œ์—์„œ๋Š” ๋จผ์ € copy() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_customer DataFrame์˜ 'customer_id'์™€ 'address' ์—ด์„ ๋ณต์‚ฌํ•˜์—ฌ df_customer_tmp ๋ณ€์ˆ˜์— ๋Œ€์ž…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

df_customer_tmp = df_customer[['customer_id', 'address']].copy()
๋‹ค์Œ์œผ๋กœ str ์ ‘๊ทผ์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 'address' ์—ด์˜ ์ฒซ 3๊ธ€์ž๋ฅผ ์ถ”์ถœํ•˜๊ณ  map() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์—ด 'prefecture_cd'์— ๋Œ€์ž…ํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

df_customer_tmp['prefecture_cd'] = df_customer['address'].str[0:3].map({'์‚ฌ์ดํƒ€๋งˆํ˜„': '11', '์ง€๋ฐ”ํ˜„': '12', '๋„์ฟ„๋„': '13', '14'})

map() ๋ฉ”์„œ๋“œ๋Š” ๊ฐ ๋„๋„๋ถ€ํ˜„ ์ด๋ฆ„์„ ํ•ด๋‹น ๋„๋„๋ถ€ํ˜„ ์ฝ”๋“œ์— ๋งคํ•‘ํ•œ๋‹ค. ๊ฒฐ๊ณผ DataFrame df_customer_tmp๋Š” 'customer_id', 'address', 'prefecture_cd' ์ปฌ๋Ÿผ์„ ๊ฐ€์ง„๋‹ค.
ย 
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๊ธฐ์กด DataFrame df_customer์—์„œ customer_id์™€ address ์—ด์„ ์„ ํƒํ•˜์—ฌ ์ƒˆ๋กœ์šด DataFrame df_customer_tmp๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ ์ •๊ทœ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ address ์—ด์—์„œ ๋„๋„๋ถ€ํ˜„ ์ฝ”๋“œ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ str.extract() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒจํ„ด(^. *? [๋„๋„๋ถ€ํ˜„])์— ์ผ์น˜ํ•˜๋Š” ๋ถ€๋ถ„ ๋ฌธ์ž์—ด์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” "๋ฌธ์ž์—ด์˜ ์‹œ์ž‘๋ถ€ํ„ฐ '้ƒฝ' ๋˜๋Š” '้“' ๋˜๋Š” 'ๅบœ' ๋˜๋Š” '็œŒ'์ด ์ฒ˜์Œ ๋‚˜ํƒ€๋‚  ๋•Œ๊นŒ์ง€์˜ ์ž„์˜์˜ ๋ฌธ์ž์™€ ์ผ์น˜ํ•˜๋Š” ๊ฒƒ"์„ ์˜๋ฏธํ•œ๋‹ค.

๋‹ค์Œ์œผ๋กœ ์ถ”์ถœ๋œ ๋„๋„๋ถ€ํ˜„ ์ด๋ฆ„์„ ํ•ด๋‹น ๋„๋„๋ถ€ํ˜„ ์ฝ”๋“œ์— ๋งคํ•‘ํ•˜์—ฌ df_customer_tmp DataFrame์— prefecture_cd๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ์ƒ์„ฑํ•œ๋‹ค. ๋งคํ•‘์€ ๋„๋„๋ถ€ํ˜„ ์ด๋ฆ„์„ ํ‚ค๋กœ, ํ•ด๋‹น ๋„๋„๋ถ€ํ˜„ ์ฝ”๋“œ๋ฅผ ๊ฐ’์œผ๋กœ ํ•˜๋Š” ์‚ฌ์ „์„ ์ด์šฉํ•œ map() ๋ฉ”์„œ๋“œ๋ฅผ ํ†ตํ•ด ์ด๋ฃจ์–ด์ง„๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ df_customer_tmp๋Š” customer_id, address, prefecture_cd์˜ ์„ธ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง€๋ฉฐ, prefecture_cd๋Š” ๋„๋„๋ถ€ํ˜„ ์ฝ”๋“œ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” 2์ž๋ฆฌ ๋ฌธ์ž์—ด๋กœ DataFrame์˜ ์ฒซ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•ด head( 10) ๋ฉ”์„œ๋“œ๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

ย 

ย 
์ด ์ฝ”๋“œ๋Š” ๊ฐ ๊ณ ๊ฐ์˜ ๋งค์ถœ ๊ธˆ์•ก ์‚ฌ๋ถ„์œ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๋งค์ถœ ๊ธˆ์•ก ์‚ฌ๋ถ„์œ„์ˆ˜์— ๋”ฐ๋ผ ๊ฐ ๊ณ ๊ฐ์„ 4๊ฐœ์˜ ๊ทธ๋ฃน ์ค‘ ํ•˜๋‚˜์— ํ• ๋‹นํ•œ๋‹ค.

๋‹ค์Œ์€ ์ฝ”๋“œ์˜ ๋‹จ๊ณ„๋ณ„ ๋ถ„์„์ด๋‹ค.

df_sales_amount = df_receipt[['customer_id', 'amount']].groupby('customer_id').sum().reset_index() df_receipt ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ customer_id๋กœ ๊ทธ๋ฃนํ™” ํ•˜์—ฌ ๊ธˆ์•ก ์—ด์„ ํ•ฉ์‚ฐํ•˜์—ฌ ๊ฐ ๊ณ ๊ฐ์˜ ์ด ๋งค์ถœ ๊ธˆ์•ก์„ ๊ณ„์‚ฐํ•œ๋‹ค. ๊ฒฐ๊ณผ dataframe์€ customer_id์™€ amount ๋‘ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง„๋‹ค.

pct25 = np.quantile(df_sales_amount['amount'], 0.25), pct50 = np.quantile(df_sales_amount['amount'], 0.5), pct75 = np.quantile(df_sales_amount['amount'], 0.5), pct75 = np.quantile(df_sales_amount['amount amount['amount'], 0.75) ๋งค์ถœ์•ก ๋ถ„ํฌ์˜ 25, 50, 75%๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

def pct_group(x): ... ๋งค์ถœ ๊ธˆ์•ก x๋ฅผ ๋ฐ›์•„ x๊ฐ€ ์†ํ•œ ์‚ฌ๋ถ„์œ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” 1๋ถ€ํ„ฐ 4๊นŒ์ง€์˜ ์ •์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜ pct_group์„ ์ •์˜ํ•œ๋‹ค.

df_sales_amount['pct_group'] = df_sales_amount['amount'].apply(pct_group) df_sales_amount ์˜ amount ์—ด์˜ ๊ฐ ๊ฐ’์— pct_group ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๊ณ , ๊ฒฐ๊ณผ ์‚ฌ๋ถ„์œ„์ˆ˜๋ฅผ pct_group์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์— ๋Œ€์ž…ํ•ฉ๋‹ˆ๋‹ค. group์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์— ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค.

df_sales_amount.head(10)๋Š” customer_id, amount, pct_group์˜ ์„ธ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง„ ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค.
ย 
ย 
ย 
์ด ์ฝ”๋“œ์—์„œ๋Š” DataFrame 'df_temp'์— ์ƒˆ๋กœ์šด ์—ด 'pct_group'์„ ์ƒ์„ฑํ•˜์—ฌ ๊ฐ ๊ณ ๊ฐ์˜ ๋งค์ถœ ๊ธˆ์•ก ํ•ฉ๊ณ„์™€ ๊ทธ ๋งค์ถœ ๊ธˆ์•ก ํ•ฉ๊ณ„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์†Œ์†๋œ ๋ฐฑ๋ถ„์œ„์ˆ˜ ๊ทธ๋ฃน์„ ์ €์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ฒซ ๋ฒˆ์งธ ํ–‰์€ 'df_receipt' DataFrame์„ customer_id๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ฐ ๊ณ ๊ฐ์˜ ๋งค์ถœ ๊ธˆ์•ก ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , customer_id๋ฅผ ๋‹ค์‹œ ์—ด๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ 4ํ–‰์€ ์•ž์„œ ์ƒ์„ฑํ•œ 'df_temp' DataFrame๊ณผ ๋™์ผํ•œ 'df_sales_amount' DataFrame์—์„œ ๋งค์ถœ ๊ธˆ์•ก์˜ 25%, 50%, 75%, ์ตœ๋Œ€๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ํ–‰์—์„œ๋Š” ์ •์˜๋œ ๋ฐฑ๋ถ„์œ„์ˆ˜ ๊ฐ’์œผ๋กœ ๋งค์ถœ์•ก์„ 4๊ฐœ์˜ ๋นˆ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ 'df_temp'์— ์ƒˆ๋กœ์šด ์—ด 'quantile'์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰ ํ–‰์—์„œ๋Š” 'df_temp'์— ์ƒˆ๋กœ์šด ์—ด 'pct_group'์„ ์ƒ์„ฑํ•˜์—ฌ 'quantile'์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , ๊ฐ ๊ณ ๊ฐ์ด ์–ด๋Š ๋ฐฑ๋ถ„์œ„์ˆ˜ ๊ทธ๋ฃน์— ์†ํ•ด ์žˆ๋Š”์ง€์— ๋”ฐ๋ผ ๊ทธ๋ฃน ๋ฒˆํ˜ธ(1-4)๋ฅผ ๋ถ€์—ฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ ๋กœ ๋งˆ์ง€๋ง‰ ๋นˆ์˜ ์˜ค๋ฅธ์ชฝ ๋์€ 'pct_max+0.1'๋กœ ์ •์˜ํ•˜์—ฌ ๋งค์ถœ ๊ธˆ์•ก์˜ ์ตœ๋Œ€๊ฐ’์ด ๋งˆ์ง€๋ง‰ ๋ฐฑ๋ถ„์œ„์ˆ˜ ๊ทธ๋ฃน์— ํฌํ•จ๋˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

df_receipt ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ customer_id๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , grouped ๊ฐ์ฒด์˜ sum() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ทธ๋ฃน์˜ ๊ธˆ์•ก ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ๋Š” df_temp๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ์ €์žฅ๋œ๋‹ค.

qcut() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_temp์˜ ๊ธˆ์•ก ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ 4 ๊ฐœ์˜ ๋™์ผํ•œ ํฌ๊ธฐ์˜ ๋นˆ์œผ๋กœ ๋ถ„ํ• ํ•˜๊ณ  retbins=True ์˜ต์…˜์€ ๋นˆ์˜ ๋์„ bins ๋ณ€์ˆ˜์— ๋ฐ˜ํ™˜ํ•œ๋‹ค.

df_temp์— quantile์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์ด ์ƒ์„ฑ๋˜์–ด ๊ฐ ๊ธˆ์•ก ๊ฐ’์ด ์–ด๋–ค ๋นˆ์— ๋“ค์–ด๊ฐˆ์ง€ ์•Œ๋ ค์ค€๋‹ค.

ngroup() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ df_temp['quantile']์˜ ๊ฐ ๋นˆ์— ๊ทธ๋ฃน ๋ฒˆํ˜ธ๋ฅผ ํ• ๋‹นํ•˜๊ณ , ngroup()์€ 0 ์ธ๋ฑ์Šค ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ ๊ฒฐ๊ณผ์— +1์ด ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค.

reset_index() ๋ฉ”์„œ๋“œ๊ฐ€ df_temp์—์„œ ํ˜ธ์ถœ๋˜์–ด ์ธ๋ฑ์Šค๊ฐ€ ์ƒˆ๋กœ์šด ๋ฒ”์œ„ ์ธ๋ฑ์Šค๋กœ ์žฌ์„ค์ •๋œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ df_temp์˜ ์ฒ˜์Œ 10ํ–‰๊ณผ ๋นˆ์˜ ๋์ด ๊ฐ๊ฐ display()์™€ print() ํ•จ์ˆ˜๋กœ ํ‘œ์‹œ๋œ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ์—์„œ๋Š” customer_id, birth_day, era ์ปฌ๋Ÿผ์„ ํฌํ•จํ•˜๋Š” ์ƒˆ๋กœ์šด DataFrame df_customer_era๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

customer_id์™€ birth_day ์ปฌ๋Ÿผ์€ ์›๋ž˜์˜ df_customer DataFrame์—์„œ ๊ฐ€์ ธ์˜จ๋‹ค.

age ์—ด์€ df_customer DataFrame์˜ age ์—ด์—์„œ ๊ณ„์‚ฐ๋˜๋ฉฐ, age์˜ ๊ฐ ๊ฐ’์— ๋Œ€ํ•ด ์ฝ”๋“œ๋Š” ๊ฐ’์„ ๋ฐ˜์˜ฌ๋ฆผํ•  10๋…„(10์˜ ๋ฐฐ์ˆ˜)์„ ๊ณ„์‚ฐํ•˜๊ณ , age์˜ ๊ฐ’์ด 60๋ณด๋‹ค ํฌ๋ฉด ์—ฐ๋Œ€๋Š” 60์œผ๋กœ ์„ค์ •๋œ๋‹ค.

apply ๋ฉ”์„œ๋“œ์—์„œ age ์—ด์˜ ๊ฐ ์š”์†Œ์— ํ•จ์ˆ˜(์—ฌ๊ธฐ์„œ๋Š” lambda ํ•จ์ˆ˜)๋ฅผ ์ ์šฉํ•˜๊ณ  min ํ•จ์ˆ˜์™€ math.floor ํ•จ์ˆ˜๋กœ 10๋…„์„ ๊ณ„์‚ฐํ•œ๋‹ค. ๊ฒฐ๊ณผ ๊ฐ’์€ df_customer_era์˜ era ์—ด์— ํ• ๋‹น๋œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ head ๋ฉ”์„œ๋“œ๊ฐ€ ํ˜ธ์ถœ๋˜์–ด ๊ฒฐ๊ณผ DataFrame์˜ ์ฒซ 10ํ–‰์ด ํ‘œ์‹œ๋œ๋‹ค.
ย 
ํ•ด์„ค :

์ด ์ฝ”๋“œ์—์„œ๋Š” customer_id, birth_day, era ์ปฌ๋Ÿผ์„ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด DataFrame df_customer_era๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, customer_id์™€ birth_day ์ปฌ๋Ÿผ์€ ์›๋ž˜์˜ df_customer DataFrame์—์„œ ๋ณต์‚ฌ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

era ์—ด์€ pd.cut()์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ฐ์˜ ์—ฐ๋ น์„ ๋‹ค๋ฅธ ์—ฐ๋ น๋Œ€๋กœ ๋น„๋‹ํ•˜์—ฌ ์ƒ์„ฑ๋˜๋ฉฐ, bins ๋งค๊ฐœ ๋ณ€์ˆ˜๋Š” ๊ฐ ๋นˆ์˜ ๋์„ ์ง€์ •ํ•˜๊ณ , np.inf๋Š” 60์„ธ ์ด์ƒ์˜ ๊ณ ๊ฐ์— ๋Œ€ํ•œ ์˜คํ”ˆ ์—”๋“œ ๋นˆ์„ ์ง€์ •ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ ๋ณ€์ˆ˜๋Š” False๋กœ ์„ค์ •๋˜์–ด ๊ตฌ๊ฐ„์ด ์™ผ์ชฝ๋ถ€ํ„ฐ ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋‚˜์ด๊ฐ€ ์ •ํ™•ํžˆ 30์„ธ์ธ ๊ณ ๊ฐ์€ [30, 40] ๋นˆ์— ๋ฐฐ์น˜๋˜๋Š” ๊ฒƒ์ด๋‹ค.

DataFrame์˜ ๊ฐ ํ–‰์— ๋Œ€ํ•ด ๊ณ ๊ฐ์˜ ์—ฐ๋ น์ด ์ง€์ •๋œ ๋นˆ์— ๋”ฐ๋ผ ํ•ด๋‹น ์—ฐ๋ น์— ํ•ด๋‹นํ•˜๋Š” ๋นˆ์— ์ฑ„์›Œ์ง„๋‹ค. ๊ฒฐ๊ณผ DataFrame์—๋Š” customer_id, birth_day, era์˜ ๊ฐ ์ปฌ๋Ÿผ์ด ํฌํ•จ๋˜๋ฉฐ, era ์ปฌ๋Ÿผ์—๋Š” ๊ฐ ๊ณ ๊ฐ์˜ ์—ฐ๋ น๋Œ€๋ณ„ ๋นˆ์ด ํฌํ•จ๋œ๋‹ค.

ย 

ํ•ด์„ค:ย 

์ด ์ฝ”๋“œ๋Š” DataFrame "df_customer_era"์— ์ƒˆ๋กœ์šด ์—ด "era"๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๊ณ ๊ฐ ์—ฐ๋ น์˜ 10 ๋…„์„ ํ‘œ์‹œํ•˜๋ฉฐ, decade์˜ ๊ฐ’์€ ์—ฐ๋ น์„ 10 ์„ธ ๋ฏธ๋งŒ์œผ๋กœ ๋ฐ˜์˜ฌ๋ฆผ ํ•œ ํ›„ 10์„ ๊ณฑํ•˜์—ฌ ์–ป์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๊ฐ€ 60๋ณด๋‹ค ํฌ๋ฉด 60์œผ๋กœ ๋ฐ˜์˜ฌ๋ฆผํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ ๋‹ค์Œ ์ฝ”๋“œ๋Š” ์„ฑ๋ณ„ ์ฝ”๋“œ์™€ "์‹œ๋Œ€"์—ด์˜ ๊ฐ’์„ ์—ฐ๊ฒฐํ•˜์—ฌ ์ƒˆ๋กœ์šด "gender_era"์—ด์„ ์ƒ์„ฑํ•˜๊ณ  "gender_cd"์—ด์—๋Š” ์„ฑ๋ณ„ ์ฝ”๋“œ (์˜ˆ : ๋‚จ์„ฑ์€ "0", ์—ฌ์„ฑ์€ "1")๊ฐ€ ํฌํ•จ๋˜๋ฉฐ "era"์—ด์˜ ๊ฐ’์€ ๋ฌธ์ž์—ด๋กœ ์บ์ŠคํŒ…๋˜๊ณ  ๋„ˆ๋น„ 2์— ๋งž๊ฒŒ 0์œผ๋กœ ํŒจ๋”ฉ๋ฉ๋‹ˆ๋‹ค. ๋œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ ์ด ์ฝ”๋“œ๋Š” ๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10์ค„์„ ํ‘œ์‹œํ•œ๋‹ค.
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” DataFrame 'df_customer'์˜ ์นดํ…Œ๊ณ ๋ฆฌ ๋ณ€์ˆ˜ 'gender_cd'์˜ ๋”๋ฏธ ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๊ฒฐ๊ณผ DataFrame์€ 'gender_cd'์˜ ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ(์•„๋งˆ๋„ ๋‚จ์„ฑ๊ณผ ์—ฌ์„ฑ)์˜ ์—ด์„ ๊ฐ€์ง€๋ฉฐ, ๊ฐ ํ–‰์€ ์›๋ž˜์˜ ๊ณ ๊ฐ ๋ ˆ์ฝ”๋“œ๊ฐ€ 'gender_cd'์— ํ•ด๋‹น ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ ํ•ด๋‹น ์—ด์— 1์„, ๋‹ค๋ฅธ ๋ชจ๋“  ์—ด์— 0์„ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค.

์ด๋ฅผ ์œ„ํ•ด pandas์˜ 'get_dummies' ํ•จ์ˆ˜๊ฐ€ ์‚ฌ์šฉ๋œ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” DataFrame๊ณผ ๋”๋ฏธ ์ธ์ฝ”๋”ฉ์„ ์ ์šฉํ•  ์ปฌ๋Ÿผ ๋ชฉ๋ก์„ ๋ฐ›๋Š”๋‹ค. ์ด ๊ฒฝ์šฐ 'df_customer' DataFrame์— ๋Œ€ํ•ด ํ˜ธ์ถœ๋˜๋ฉฐ, 'columns' ํŒŒ๋ผ๋ฏธํ„ฐ์— 'gender_cd' ์ปฌ๋Ÿผ์ด ์ง€์ •๋˜์–ด ์žˆ๋‹ค.

๊ฒฐ๊ณผ DataFrame์€ 'gender_cd'์˜ ๊ฐ ๊ฐ’์— ๋Œ€ํ•ด ํ•˜๋‚˜์˜ ์—ด์„ ๊ฐ€์ง€๋ฉฐ(์•„๋งˆ๋„ ์›๋ณธ ๋ฐ์ดํ„ฐ์— ์‚ฌ์šฉ๋œ ์ธ์ฝ”๋”ฉ์— ๋”ฐ๋ผ ๋‚จ์„ฑ์€ 0, ์—ฌ์„ฑ์€ 1 ๋˜๋Š” ๊ทธ ๋ฐ˜๋Œ€๋กœ), ๊ฐ ํ–‰์€ ํ•ด๋‹น ํ–‰์˜ 'gender_cd' ๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” ์—ด์— 1์„ ๊ฐ€์ง€๋ฉฐ, ๋‹ค๋ฅธ ๋ชจ๋“  ์—ด์—๋Š” 0์„ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค. ๊ฒฐ๊ณผ DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์ด ๋ฐ˜ํ™˜๋œ๋‹ค.
ย 
ย 
ํ•ด์„ค :ย 

์ด ์ฝ”๋“œ๋Š” df_customer DataFrame์˜ ์นดํ…Œ๊ณ ๋ฆฌ ๋ณ€์ˆ˜ gender_cd์— ๋Œ€ํ•œ ๋”๋ฏธ ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ pd.get_dummies() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ gender_cd์˜ ๊ณ ์œ ํ•œ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ๋ฐ”์ด๋„ˆ๋ฆฌ ์ธ๋ฑ์Šค ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑ ํ•ฉ๋‹ˆ๋‹ค.

drop_first=True ์ธ์ˆ˜๋Š” ์ฐธ์กฐ ์นดํ…Œ๊ณ ๋ฆฌ์ธ ์ฒซ ๋ฒˆ์งธ ์นดํ…Œ๊ณ ๋ฆฌ('0')๋ฅผ ์‚ญ์ œํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ์ด๋Š” ๋”๋ฏธ ๋ณ€์ˆ˜ ๊ฐ„์˜ ๋‹ค์ค‘ ๊ณต์„ ์„ฑ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜ํ–‰๋œ๋‹ค.

prefix='gen' ๋ฐ prefix_sep='#' ์ธ์ˆ˜๋Š” ๋”๋ฏธ ๋ณ€์ˆ˜์˜ ์—ด ์ด๋ฆ„์— ์ ‘๋‘์‚ฌ๋ฅผ ๋ถ™์ด๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค. ์ด ๊ฒฝ์šฐ ์ ‘๋‘์‚ฌ๋Š” 'gen', ๊ตฌ๋ถ„์ž๋Š” '#'์ด๋‹ค.

๊ฒฐ๊ณผ DataFrame์€ gender_cd ์—ด์˜ ๊ณ ์œ ํ•œ ๊ฐ’๋งˆ๋‹ค ํ•˜๋‚˜์˜ ์—ด์„ ๊ฐ€์ง€๋ฉฐ, ๊ทธ ๊ฐ’์ด 1์ด๋ฉด ๊ณ ๊ฐ์ด ํ•ด๋‹น ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ด๊ณ , 0์ด๋ฉด ์†ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ปฌ๋Ÿผ ์ด๋ฆ„์€ gen#<gender_cd> ํ˜•์‹์ด๋ฉฐ, <gender_cd>๋Š” ์›๋ž˜์˜ gender_cd ์ปฌ๋Ÿผ์˜ ๊ณ ์œ ํ•œ ๊ฐ’์ž„์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

df_receipt์—์„œ customer_id๊ฐ€ "Z"๋กœ ์‹œ์ž‘ํ•˜๋Š” ํ–‰์„ ํ•„ํ„ฐ๋งํ•œ๋‹ค.

๋‚˜๋จธ์ง€ ํ–‰์„ customer_id๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  agg ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ทธ๋ฃน์˜ ๊ธˆ์•ก ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

๊ฒฐ๊ณผ DataFrame์˜ ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜์—ฌ customer_id๋ฅผ ์ผ๋ฐ˜ ์ปฌ๋Ÿผ์œผ๋กœ ๋งŒ๋“ ๋‹ค.

scikit-learn์˜ preprocessing.scale ํ•จ์ˆ˜๋ฅผ amount ์—ด์— ์ ์šฉํ•˜์—ฌ ๊ฐ ๊ณ ๊ฐ์˜ ์ด ๋งค์ถœ์˜ z ์ ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

DataFrame์— std_amount๋ผ๋Š” ์ƒˆ๋กœ์šด ์—ด์„ ์ถ”๊ฐ€ํ•˜๊ณ  4๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐํ•œ z์ ์ˆ˜๋ฅผ ์ €์žฅํ•œ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ DataFrame df_sales_amount์—๋Š” ๊ฐ ๊ณ ๊ฐ์— ๋Œ€ํ•ด ํ•œ ํ–‰์ด ํฌํ•จ๋˜๋ฉฐ, ๊ณ ๊ฐ ID, ์ด ๋งค์ถœ ๊ธˆ์•ก, ๋งค์ถœ์˜ z ์ ์ˆ˜์ธ std_amount ์—ด์ด ํฌํ•จ๋œ๋‹ค.
ย 
ย 
ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” 'df_sales_amount' DataFrame์˜ 'amount' ์—ด์— ๋Œ€ํ•ด ํ‘œ์ค€ํ™”(Z ์ ์ˆ˜ ์ •๊ทœํ™”๋ผ๊ณ ๋„ ํ•จ)๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.

์ด ์ฝ”๋“œ์—์„œ๋Š” ๋จผ์ € pandas DataFrame์˜ 'query' ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ID๊ฐ€ 'Z'๋กœ ์‹œ์ž‘ํ•˜๋Š” ๊ณ ๊ฐ์˜ ํŠธ๋žœ์žญ์…˜์„ ํ•„ํ„ฐ๋งํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‚˜๋จธ์ง€ ๊ฑฐ๋ž˜๋ฅผ 'customer_id'๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  'agg' ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ณ ๊ฐ์ด ์‚ฌ์šฉํ•œ ์ด ๊ธˆ์•ก์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฒฐ๊ณผ DataFrame์˜ ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ ์ด ์ฝ”๋“œ๋Š” scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ 'preprocessing' ๋ชจ๋“ˆ์—์„œ 'StandardScaler' ํด๋ž˜์Šค์˜ ์ธ์Šคํ„ด์Šค๋ฅผ ์ƒ์„ฑํ•˜๋Š”๋ฐ, 'StandardScaler' ํด๋ž˜์Šค๋Š” ํ‰๊ท ์ด 0, ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1์ด ๋˜๋„๋ก ๋ฐ์ดํ„ฐ๋ฅผ ์Šค์ผ€์ผ๋งํ•˜๊ณ  ํ‘œ์ค€ํ™”์— ์‚ฌ์šฉ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  'scaler' ์ธ์Šคํ„ด์Šค์˜ 'fit' ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 'df_sales_amount' DataFrame์˜ 'amount' ์—ด์˜ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ 'scaler' ์ธ์Šคํ„ด์Šค์˜ 'transform' ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 'df_sales_amount' DataFrame์˜ 'amount' ์ปฌ๋Ÿผ์— ๋Œ€ํ•ด ์‹ค์ œ ํ‘œ์ค€ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ๊ฒฐ๊ณผ๊ฐ’์„ 'std_amount'๋ผ๋Š” ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์— ์ €์žฅํ•œ๋‹ค. ํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ DataFrame 'df_sales_amount'์—๋Š” 'customer_id'์™€ ๊ฐ ๊ณ ๊ฐ์ด ์‚ฌ์šฉํ•œ ์ด 'amount'์™€ ํ‘œ์ค€ํ™”๋œ 'std_amount' ์ปฌ๋Ÿผ์ด ํฌํ•จ๋œ๋‹ค.
ย 
ย 
์ด ์ฝ”๋“œ๋Š” pandas์˜ DataFrame df_receipt์— ๋Œ€ํ•ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ๋ฅผ ํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” ์ฝ”๋“œ์˜ ๊ฐ ํ–‰์ด ๋ฌด์—‡์„ ํ•˜๋Š”์ง€ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

query() ๋ฉ”์„œ๋“œ์™€ regex engine python์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ฐ ID๊ฐ€ "Z"๋กœ ์‹œ์ž‘ํ•˜๋Š” ํ–‰์„ ํ•„ํ„ฐ๋งํ•ฉ๋‹ˆ๋‹ค. ์–ป์€ DataFrame์„ df_sales_amount์— ๋Œ€์ž…ํ•œ๋‹ค.

groupby() ๋ฉ”์„œ๋“œ๋กœ DataFrame df_sales_amount๋ฅผ ๊ณ ๊ฐ ID๋ณ„๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ , agg() ๋ฉ”์„œ๋“œ๋กœ ๊ฐ ๊ทธ๋ฃน์˜ ๊ธˆ์•ก ์—ด์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ์™„์„ฑ๋œ DataFrame์„ df_sales_amount์— ๋Œ€์ž…ํ•œ๋‹ค.

reset_index() ๋ฉ”์„œ๋“œ๋กœ df_sales_amount์˜ ์ธ๋ฑ์Šค๋ฅผ ์žฌ์„ค์ •ํ•˜๊ณ , customer_id ์ปฌ๋Ÿผ์„ ์ผ๋ฐ˜ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค.

scikit-learn ์ „์ฒ˜๋ฆฌ ๋ชจ๋“ˆ์˜ minmax_scale() ํ•จ์ˆ˜๋ฅผ df_sales_amount์˜ amount ์ปฌ๋Ÿผ์— ์ ์šฉํ•˜์—ฌ 0์—์„œ 1 ์‚ฌ์ด๋กœ ๊ฐ’์„ ์Šค์ผ€์ผ๋งํ•œ๋‹ค.

๋”ฐ๋ผ์„œ ์ด ์ฝ”๋“œ์—์„œ๋Š” ๊ฐ ๊ณ ๊ฐ์˜ ๋งค์ถœ ๊ธˆ์•ก์˜ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ํŠน์ • ๊ณ ๊ฐ(ID๊ฐ€ 'Z'๋กœ ์‹œ์ž‘ํ•˜๋Š” ๊ณ ๊ฐ)์˜ ํ–‰์„ ํ•„ํ„ฐ๋งํ•˜๊ณ , ๊ฐ ๊ณ ๊ฐ์˜ ๋งค์ถœ ๊ธˆ์•ก์„ ์ตœ์†Œ-์ตœ๋Œ€ ์Šค์ผ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜์—ฌ 0๊ณผ 1 ์‚ฌ์ด๋กœ ์Šค์ผ€์ผ๋งํ•ฉ๋‹ˆ๋‹ค.
ย 

ย 

ํ•ด์„ค:

์ด ์ฝ”๋“œ๋Š” ์†Œ๋งค์ ์˜ ๊ฐ ๊ณ ๊ฐ์˜ ๋งค์ถœ ๊ธˆ์•ก ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ธฐ๋Šฅ ์Šค์ผ€์ผ๋ง์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ๋‹ค์Œ์€ ๊ฐ ํ–‰์˜ ์ฒ˜๋ฆฌ ๋‚ด์šฉ์ด๋‹ค.

DF_SALES_AMOUNT: df_receipt ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ customer_id๋กœ ๊ทธ๋ฃนํ™”ํ•˜๊ณ  ๊ฐ ๊ทธ๋ฃน์˜ ๊ธˆ์•ก์„ ํ•ฉ์‚ฐํ•˜์—ฌ ์–ป์€ customer_id์™€ ๊ฐ ๊ณ ๊ฐ์ด ์‚ฌ์šฉํ•œ ์ด ๊ธˆ์•ก์„ ํฌํ•จํ•˜๋Š” ์ƒˆ๋กœ์šด DataFrame์„ ์ƒ์„ฑํ•œ๋‹ค.

scaler: scikit-learn์˜ preprocessing ๋ชจ๋“ˆ์—์„œ MinMaxScaler ๊ฐ์ฒด๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , ๋ฐ์ดํ„ฐ๋ฅผ [0,1] ๋ฒ”์œ„์—์„œ ์Šค์ผ€์ผ๋งํ•œ๋‹ค.

scaler.fit(): df_sales_amount์˜ amount ์—ด์— ์Šค์ผ€์ผ๋Ÿฌ ๊ฐ์ฒด๋ฅผ ๋งž์ถ”๊ณ  ๋ฐ์ดํ„ฐ์˜ ์ตœ์†Œ๊ฐ’๊ณผ ์ตœ๋Œ€๊ฐ’์„ ๊ณ„์‚ฐํ•œ๋‹ค.

df_sales_amount['scale_amount']: df_sales_amount DataFrame์— ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•˜์—ฌ ์Šค์ผ€์ผ๋ง๋œ ๊ธˆ์•ก ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•œ๋‹ค.

scaler.transform(): scaler ๊ฐ์ฒด๊ฐ€ ํ•™์Šตํ•œ ์ตœ์†Œ๊ฐ’๊ณผ ์ตœ๋Œ€๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๊ณ ๊ฐ์˜ ๊ธˆ์•ก ๋ฐ์ดํ„ฐ๋ฅผ ์Šค์ผ€์ผ๋งํ•œ๋‹ค.

df_sales_amount.head(10): ์—…๋ฐ์ดํŠธ๋œ df_sales_amount DataFrame์˜ ์ฒ˜์Œ 10๊ฐœ์˜ ํ–‰์„ ํ‘œ์‹œํ•˜๊ณ  ์ƒˆ๋กœ์šด scale_amount ์—ด์˜ ์Šค์ผ€์ผ๋ง๋œ ๊ธˆ์•ก ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•œ๋‹ค.

์ „์ฒด์ ์œผ๋กœ ์ด ์ฝ”๋“œ๋Š” ๊ฐ ๊ณ ๊ฐ์˜ ๊ธˆ์•ก ๋ฐ์ดํ„ฐ ๊ฐ’์„ ๋™์ผํ•œ ๋ฒ”์œ„๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ํŠน์ง•์ ์ธ ์Šค์ผ€์ผ๋ง์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ณ  ๋ถ„์„์— ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

ย 

ย 

Comment