Cohort Analysis and RFM Segmentation#

Cohort : Group of subject sharing a defining characteristic
Can be viewed acrossed time, version etc…
Cohorts are used in medicine, psychology, econometrics, ecology and many other areas to perform a cross-section (compare difference across subjects) at intervals through time.
Types of cohort
- Time Cohorts are customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customers’ behavior depending on the time they started using the company’s products or services. The time may be monthly or quarterly even daily.
- Behaovior cohorts are customers who purchased a product or subscribed to a service in the past. It groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understaning the needs of the various cohorts can help a company design custom-made services or products for particular segments.
- Size cohorts refer to the various sizes of customers who purchase company’s products or services. This categorization can be based on the amount of spending in some periodic time after acquisition or the product type that the customer spent most of their order amount in some period of time.

Imports#

import pandas as pd
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt 
import sklearn
import numpy as np
import pickle
import joblib
import itertools
import datetime as dt
from sklearn.linear_model import LogisticRegression
from aiking.data.external import *

sns.set()

Read and clean the dataset#

# path = untar_data("kaggle_datasets::jihyeseo/online-retail-data-set-from-uci-ml-repo"); path

# !ls /Landmark2/pdo/aiking/archive/ # -d /Landmark2/pdo/aiking/data/online-retail-data-set-from-uci-ml-repo

# !ls /Landmark2/pdo/aiking/data/

df = pd.read_excel('Online Retail.xlsx'); df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom

df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB

df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

df = df.dropna(subset=['CustomerID']); df.isnull().sum().sum()

df.duplicated().sum()

df = df.drop_duplicates()

df.describe()

	Quantity	UnitPrice	CustomerID
count	401604.000000	401604.000000	401604.000000
mean	12.183273	3.474064	15281.160818
std	250.283037	69.764035	1714.006089
min	-80995.000000	0.000000	12346.000000
25%	2.000000	1.250000	13939.000000
50%	5.000000	1.950000	15145.000000
75%	12.000000	3.750000	16784.000000
max	80995.000000	38970.000000	18287.000000

df=df[(df['Quantity']>0) & df['UnitPrice']>0]; df.describe()

	Quantity	UnitPrice	CustomerID
count	392692.000000	392692.000000	392692.000000
mean	13.119702	3.125914	15287.843865
std	180.492832	22.241836	1713.539549
min	1.000000	0.001000	12346.000000
25%	2.000000	1.250000	13955.000000
50%	6.000000	1.950000	15150.000000
75%	12.000000	3.750000	16791.000000
max	80995.000000	8142.750000	18287.000000

df.shape

(392692, 8)

Cohort Analysis#

We need to create labels

Invoice period - Year & month of a single transaction

Cohort group - Year & month of customer first purchase

Cohort period/ Cohort index - Customer stage in its lifetime(int). It is number of months passed since first purchase

df.nunique()

InvoiceNo      18532
StockCode       3665
Description     3877
Quantity         301
InvoiceDate    17282
UnitPrice        440
CustomerID      4338
Country           37
dtype: int64

sample = df[df['CustomerID'].isin([12347.0, 18283.0, 18287.0])].reset_index(drop=True); sample

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	537626	85116	BLACK CANDELABRA T-LIGHT HOLDER	12	2010-12-07 14:57:00	2.10	12347.0	Iceland
1	537626	22375	AIRLINE BAG VINTAGE JET SET BROWN	4	2010-12-07 14:57:00	4.25	12347.0	Iceland
2	537626	71477	COLOUR GLASS. STAR T-LIGHT HOLDER	12	2010-12-07 14:57:00	3.25	12347.0	Iceland
3	537626	22492	MINI PAINT SET VINTAGE	36	2010-12-07 14:57:00	0.65	12347.0	Iceland
4	537626	22771	CLEAR DRAWER KNOB ACRYLIC EDWARDIAN	12	2010-12-07 14:57:00	1.25	12347.0	Iceland
...	...	...	...	...	...	...	...	...
968	581180	20719	WOODLAND CHARLOTTE BAG	10	2011-12-07 15:52:00	0.85	12347.0	Iceland
969	581180	21265	PINK GOOSE FEATHER TREE 60CM	12	2011-12-07 15:52:00	1.95	12347.0	Iceland
970	581180	23271	CHRISTMAS TABLE SILVER CANDLE SPIKE	16	2011-12-07 15:52:00	0.83	12347.0	Iceland
971	581180	23506	MINI PLAYING CARDS SPACEBOY	20	2011-12-07 15:52:00	0.42	12347.0	Iceland
972	581180	23508	MINI PLAYING CARDS DOLLY GIRL	20	2011-12-07 15:52:00	0.42	12347.0	Iceland

973 rows × 8 columns

sample['InvoiceMonth'] = sample['InvoiceDate'].apply(lambda x: dt.datetime(x.year, x.month, 1)); sample.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvoiceMonth
0	537626	85116	BLACK CANDELABRA T-LIGHT HOLDER	12	2010-12-07 14:57:00	2.10	12347.0	Iceland	2010-12-01
1	537626	22375	AIRLINE BAG VINTAGE JET SET BROWN	4	2010-12-07 14:57:00	4.25	12347.0	Iceland	2010-12-01
2	537626	71477	COLOUR GLASS. STAR T-LIGHT HOLDER	12	2010-12-07 14:57:00	3.25	12347.0	Iceland	2010-12-01
3	537626	22492	MINI PAINT SET VINTAGE	36	2010-12-07 14:57:00	0.65	12347.0	Iceland	2010-12-01
4	537626	22771	CLEAR DRAWER KNOB ACRYLIC EDWARDIAN	12	2010-12-07 14:57:00	1.25	12347.0	Iceland	2010-12-01

# sample['CohortMonth'] = 
sample['CohortMonth'] = sample.groupby('CustomerID')['InvoiceMonth'].transform('min'); sample.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvoiceMonth	CohortMonth
0	537626	85116	BLACK CANDELABRA T-LIGHT HOLDER	12	2010-12-07 14:57:00	2.10	12347.0	Iceland	2010-12-01	2010-12-01
1	537626	22375	AIRLINE BAG VINTAGE JET SET BROWN	4	2010-12-07 14:57:00	4.25	12347.0	Iceland	2010-12-01	2010-12-01
2	537626	71477	COLOUR GLASS. STAR T-LIGHT HOLDER	12	2010-12-07 14:57:00	3.25	12347.0	Iceland	2010-12-01	2010-12-01
3	537626	22492	MINI PAINT SET VINTAGE	36	2010-12-07 14:57:00	0.65	12347.0	Iceland	2010-12-01	2010-12-01
4	537626	22771	CLEAR DRAWER KNOB ACRYLIC EDWARDIAN	12	2010-12-07 14:57:00	1.25	12347.0	Iceland	2010-12-01	2010-12-01

sample['CohortIndex'] = (sample['InvoiceMonth'].dt.year-sample['CohortMonth'].dt.year)*12\
                       + sample['InvoiceMonth'].dt.month-sample['CohortMonth'].dt.month\
                       + 1; sample.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvoiceMonth	CohortMonth	CohortIndex
0	537626	85116	BLACK CANDELABRA T-LIGHT HOLDER	12	2010-12-07 14:57:00	2.10	12347.0	Iceland	2010-12-01	2010-12-01	1
1	537626	22375	AIRLINE BAG VINTAGE JET SET BROWN	4	2010-12-07 14:57:00	4.25	12347.0	Iceland	2010-12-01	2010-12-01	1
2	537626	71477	COLOUR GLASS. STAR T-LIGHT HOLDER	12	2010-12-07 14:57:00	3.25	12347.0	Iceland	2010-12-01	2010-12-01	1
3	537626	22492	MINI PAINT SET VINTAGE	36	2010-12-07 14:57:00	0.65	12347.0	Iceland	2010-12-01	2010-12-01	1
4	537626	22771	CLEAR DRAWER KNOB ACRYLIC EDWARDIAN	12	2010-12-07 14:57:00	1.25	12347.0	Iceland	2010-12-01	2010-12-01	1

sample_data = sample.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].apply(pd.Series.nunique).reset_index();
sample_data

	CohortMonth	CohortIndex	CustomerID
0	2010-12-01	1	1
1	2010-12-01	2	1
2	2010-12-01	5	1
3	2010-12-01	7	1
4	2010-12-01	9	1
5	2010-12-01	11	1
6	2010-12-01	13	1
7	2011-01-01	1	1
8	2011-01-01	2	1
9	2011-01-01	4	1
10	2011-01-01	5	1
11	2011-01-01	6	1
12	2011-01-01	7	1
13	2011-01-01	9	1
14	2011-01-01	10	1
15	2011-01-01	11	1
16	2011-01-01	12	1
17	2011-05-01	1	1
18	2011-05-01	6	1

sample_data.pivot(index='CohortMonth', 
                  columns='CohortIndex',
                  values='CustomerID')

CohortIndex	1	2	4	5	6	7	9	10	11	12	13
CohortMonth
2010-12-01	1.0	1.0	NaN	1.0	NaN	1.0	1.0	NaN	1.0	NaN	1.0
2011-01-01	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	NaN
2011-05-01	1.0	NaN	NaN	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN

df['InvoiceMonth'] = df['InvoiceDate'].apply(lambda x: dt.datetime(x.year, x.month, 1))
# df['InvoiceMonth'].unique()
df['CohortMonth'] = df.groupby('CustomerID')['InvoiceMonth'].transform('min')
# df['CohortMonth'].unique()
df['CohortIndex'] = (df['InvoiceMonth'].dt.year-df['CohortMonth'].dt.year)*12\
                       + df['InvoiceMonth'].dt.month-df['CohortMonth'].dt.month\
                       + 1; df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvoiceMonth	CohortMonth	CohortIndex
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	2010-12-01	2010-12-01	1
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	2010-12-01	1
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	2010-12-01	2010-12-01	1
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	2010-12-01	1
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	2010-12-01	1

df.CohortIndex.unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13])

Retention Rate Table#

cohort_counts = df.groupby(['CohortMonth', 'CohortIndex'])[['CustomerID']].count().unstack(); cohort_counts

	CustomerID
CohortIndex	1	2	3	4	5	6	7	8	9	10	11	12	13
CohortMonth
2010-12-01	25670.0	10111.0	8689.0	11121.0	9628.0	11946.0	11069.0	11312.0	11316.0	14098.0	13399.0	21677.0	7173.0
2011-01-01	10877.0	2191.0	3012.0	2290.0	3603.0	3214.0	2776.0	2844.0	3768.0	4987.0	6248.0	1334.0	NaN
2011-02-01	8826.0	1388.0	1909.0	2487.0	2266.0	2012.0	2241.0	2720.0	2940.0	2916.0	451.0	NaN	NaN
2011-03-01	11349.0	1421.0	2598.0	2372.0	2435.0	2103.0	2942.0	3528.0	4214.0	967.0	NaN	NaN	NaN
2011-04-01	7185.0	1398.0	1284.0	1296.0	1343.0	2007.0	1869.0	2130.0	513.0	NaN	NaN	NaN	NaN
2011-05-01	6041.0	1075.0	906.0	917.0	1493.0	2329.0	1949.0	764.0	NaN	NaN	NaN	NaN	NaN
2011-06-01	5646.0	905.0	707.0	1511.0	1738.0	2545.0	616.0	NaN	NaN	NaN	NaN	NaN	NaN
2011-07-01	4938.0	501.0	1314.0	1336.0	1760.0	517.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-08-01	4818.0	1591.0	2831.0	2801.0	899.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-09-01	8225.0	2336.0	2608.0	862.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-10-01	11500.0	3499.0	869.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-11-01	10821.0	1100.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-01	961.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

cohort_size = cohort_counts.iloc[:,0]; cohort_size

CohortMonth
2010-12-01    25670.0
2011-01-01    10877.0
2011-02-01     8826.0
2011-03-01    11349.0
2011-04-01     7185.0
2011-05-01     6041.0
2011-06-01     5646.0
2011-07-01     4938.0
2011-08-01     4818.0
2011-09-01     8225.0
2011-10-01    11500.0
2011-11-01    10821.0
2011-12-01      961.0
Name: (CustomerID, 1), dtype: float64

retention = (cohort_counts.divide(cohort_size, axis=0)*100).round(3)
retention.columns = retention.columns.droplevel()
retention.index = retention.index.astype('str')
retention

CohortIndex	1	2	3	4	5	6	7	8	9	10	11	12	13
CohortMonth
2010-12-01	100.0	39.388	33.849	43.323	37.507	46.537	43.120	44.067	44.083	54.920	52.197	84.445	27.943
2011-01-01	100.0	20.143	27.691	21.054	33.125	29.549	25.522	26.147	34.642	45.849	57.442	12.264	NaN
2011-02-01	100.0	15.726	21.629	28.178	25.674	22.796	25.391	30.818	33.311	33.039	5.110	NaN	NaN
2011-03-01	100.0	12.521	22.892	20.901	21.456	18.530	25.923	31.086	37.131	8.521	NaN	NaN	NaN
2011-04-01	100.0	19.457	17.871	18.038	18.692	27.933	26.013	29.645	7.140	NaN	NaN	NaN	NaN
2011-05-01	100.0	17.795	14.998	15.180	24.714	38.553	32.263	12.647	NaN	NaN	NaN	NaN	NaN
2011-06-01	100.0	16.029	12.522	26.762	30.783	45.076	10.910	NaN	NaN	NaN	NaN	NaN	NaN
2011-07-01	100.0	10.146	26.610	27.055	35.642	10.470	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-08-01	100.0	33.022	58.759	58.136	18.659	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-09-01	100.0	28.401	31.708	10.480	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-10-01	100.0	30.426	7.557	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-11-01	100.0	10.165	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-01	100.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

fig, ax = plt.subplots(1,1, figsize=(12,8))
sns.heatmap(data=retention, annot=True, fmt='0.0f', ax =ax, vmin = 0.0,vmax = 100,cmap="BuPu_r").set(title='Retention Rate')
plt.show()

../../_images/10_cohort_analysis_rfm_segmentation_34_0.png

Average Quantity Purchased#

cohort_counts = df.groupby(['CohortMonth', 'CohortIndex'])[['Quantity']].mean().unstack(); cohort_counts

	Quantity
CohortIndex	1	2	3	4	5	6	7	8	9	10	11	12	13
CohortMonth
2010-12-01	12.117180	15.670062	15.725860	15.931121	13.625364	14.922736	16.113199	15.638083	18.207405	17.700028	19.047018	13.599991	15.383382
2011-01-01	17.471086	13.471931	12.707503	15.283843	12.845407	15.388923	14.974063	14.991561	11.628981	10.623621	9.597151	10.184408	NaN
2011-02-01	11.201903	13.740634	19.032478	12.045838	12.335834	12.330517	13.571174	13.401471	10.965646	12.416324	13.390244	NaN	NaN
2011-03-01	9.962552	11.741027	13.310624	10.120573	13.756057	13.014265	13.456492	13.851474	11.324869	9.700103	NaN	NaN	NaN
2011-04-01	10.043702	10.417740	9.772586	11.870370	11.962770	8.691579	10.001070	9.678404	7.567251	NaN	NaN	NaN	NaN
2011-05-01	11.457044	9.745116	14.224062	12.757906	11.217013	8.758695	10.764495	113.763089	NaN	NaN	NaN	NaN	NaN
2011-06-01	10.664896	14.727072	10.869873	13.663137	10.690449	9.960707	9.506494	NaN	NaN	NaN	NaN	NaN	NaN
2011-07-01	9.921021	13.750499	7.398021	8.178144	6.213636	7.164410	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-08-01	10.083230	6.199246	5.440127	6.150660	7.056730	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-09-01	12.138359	6.316353	8.090107	8.959397	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-10-01	8.999304	7.275221	8.492520	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-11-01	7.906848	10.027273	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-01	15.185224	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

avg_quantity = cohort_counts
avg_quantity.columns = avg_quantity.columns.droplevel()
avg_quantity.index = avg_quantity.index.astype('str')
avg_quantity

CohortIndex	1	2	3	4	5	6	7	8	9	10	11	12	13
CohortMonth
2010-12-01	12.117180	15.670062	15.725860	15.931121	13.625364	14.922736	16.113199	15.638083	18.207405	17.700028	19.047018	13.599991	15.383382
2011-01-01	17.471086	13.471931	12.707503	15.283843	12.845407	15.388923	14.974063	14.991561	11.628981	10.623621	9.597151	10.184408	NaN
2011-02-01	11.201903	13.740634	19.032478	12.045838	12.335834	12.330517	13.571174	13.401471	10.965646	12.416324	13.390244	NaN	NaN
2011-03-01	9.962552	11.741027	13.310624	10.120573	13.756057	13.014265	13.456492	13.851474	11.324869	9.700103	NaN	NaN	NaN
2011-04-01	10.043702	10.417740	9.772586	11.870370	11.962770	8.691579	10.001070	9.678404	7.567251	NaN	NaN	NaN	NaN
2011-05-01	11.457044	9.745116	14.224062	12.757906	11.217013	8.758695	10.764495	113.763089	NaN	NaN	NaN	NaN	NaN
2011-06-01	10.664896	14.727072	10.869873	13.663137	10.690449	9.960707	9.506494	NaN	NaN	NaN	NaN	NaN	NaN
2011-07-01	9.921021	13.750499	7.398021	8.178144	6.213636	7.164410	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-08-01	10.083230	6.199246	5.440127	6.150660	7.056730	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-09-01	12.138359	6.316353	8.090107	8.959397	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-10-01	8.999304	7.275221	8.492520	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-11-01	7.906848	10.027273	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2011-12-01	15.185224	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

avg_quantity.max().max()

113.7630890052356

fig, ax = plt.subplots(1,1, figsize=(12,8))
sns.heatmap(data=avg_quantity, annot=True, fmt='0.0f', ax =ax, vmin = 0.0,vmax = 20,cmap="BuGn_r").set(title='Avg Quantity')
plt.show()

../../_images/10_cohort_analysis_rfm_segmentation_39_0.png

RFM Analysis#

The RFM values can be grouped in several ways:

1.Percentiles e.g. quantiles

2.Pareto 80/20 cut

3.Custom - based on business knowledge

We are going to implement percentile-based grouping.

Process of calculating percentiles:

Sort customers based on that metric
Break customers into a pre-defined number of groups of equal size
Assign a label to each group
Recency, Frequency and Monetary Value Calculations
Recency
- When was last order?
- Number of days since last purchase/ last visit/ last login
Frequency
- Number of purchases in given period (3 - 6 or 12 months)
- How many or how often customer used the product of company
- Bigger Value => More engaged customer
- Not VIP [ Need to associate to monetary value for that]
Monetary
- Total amount of money spent in period selected above
- Differentiate between MVP/ VIP

df['TotalSum'] = df['UnitPrice']*df['Quantity']; df.head()

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	InvoiceMonth	CohortMonth	CohortIndex	TotalSum
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom	2010-12-01	2010-12-01	1	15.30
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	2010-12-01	1	20.34
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom	2010-12-01	2010-12-01	1	22.00
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	2010-12-01	1	20.34
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom	2010-12-01	2010-12-01	1	20.34

snapshot_date = df['InvoiceDate'].max()+ dt.timedelta(days=1); snapshot_date

Timestamp('2011-12-10 12:50:00')

rfm = df.groupby(['CustomerID']).agg({'InvoiceDate': lambda x: (snapshot_date -x.max()).days, 
                                      'InvoiceNo':'count',
                                      'TotalSum':'sum'})
rfm

	InvoiceDate	InvoiceNo	TotalSum
CustomerID
12346.0	326	1	77183.60
12347.0	2	182	4310.00
12348.0	75	31	1797.24
12349.0	19	73	1757.55
12350.0	310	17	334.40
...	...	...	...
18280.0	278	10	180.60
18281.0	181	7	80.82
18282.0	8	12	178.05
18283.0	4	721	2045.53
18287.0	43	70	1837.28

4338 rows × 3 columns

rfm.rename(columns={'InvoiceDate':'Recency',
                    'InvoiceNo': 'Frequency',
                    'TotalSum': 'MonetaryValue'},
           inplace = True,
        )
rfm

	Recency	Frequency	MonetaryValue
CustomerID
12346.0	326	1	77183.60
12347.0	2	182	4310.00
12348.0	75	31	1797.24
12349.0	19	73	1757.55
12350.0	310	17	334.40
...	...	...	...
18280.0	278	10	180.60
18281.0	181	7	80.82
18282.0	8	12	178.05
18283.0	4	721	2045.53
18287.0	43	70	1837.28

4338 rows × 3 columns

Recency

Better rating to customer who have been active more recently

Frequency & Monetary Value

Different rating / higher label (than above)-we want to spend more money & visit more often

Now let’s see the magic happen

AIBook

Cohort Analysis and RFM Segmentation

Contents