Cohort Analysis and RFM Segmentation#

  • Cohort : Group of subject sharing a defining characteristic

  • Can be viewed acrossed time, version etc…

  • Cohorts are used in medicine, psychology, econometrics, ecology and many other areas to perform a cross-section (compare difference across subjects) at intervals through time.

  • Types of cohort

    • Time Cohorts are customers who signed up for a product or service during a particular time frame. Analyzing these cohorts shows the customers’ behavior depending on the time they started using the company’s products or services. The time may be monthly or quarterly even daily.

    • Behaovior cohorts are customers who purchased a product or subscribed to a service in the past. It groups customers by the type of product or service they signed up. Customers who signed up for basic level services might have different needs than those who signed up for advanced services. Understaning the needs of the various cohorts can help a company design custom-made services or products for particular segments.

    • Size cohorts refer to the various sizes of customers who purchase company’s products or services. This categorization can be based on the amount of spending in some periodic time after acquisition or the product type that the customer spent most of their order amount in some period of time.

Imports#

import pandas as pd
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt 
import sklearn
import numpy as np
import pickle
import joblib
import itertools
import datetime as dt
from sklearn.linear_model import LogisticRegression
from aiking.data.external import *
sns.set()

Read and clean the dataset#

# path = untar_data("kaggle_datasets::jihyeseo/online-retail-data-set-from-uci-ml-repo"); path
# !ls /Landmark2/pdo/aiking/archive/ # -d /Landmark2/pdo/aiking/data/online-retail-data-set-from-uci-ml-repo
# !ls /Landmark2/pdo/aiking/data/
df = pd.read_excel('Online Retail.xlsx'); df.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
df.columns
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
df.isnull().sum()
InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64
df = df.dropna(subset=['CustomerID']); df.isnull().sum().sum()
0
df.duplicated().sum()
5225
df = df.drop_duplicates()
df.describe()
Quantity UnitPrice CustomerID
count 401604.000000 401604.000000 401604.000000
mean 12.183273 3.474064 15281.160818
std 250.283037 69.764035 1714.006089
min -80995.000000 0.000000 12346.000000
25% 2.000000 1.250000 13939.000000
50% 5.000000 1.950000 15145.000000
75% 12.000000 3.750000 16784.000000
max 80995.000000 38970.000000 18287.000000
df=df[(df['Quantity']>0) & df['UnitPrice']>0]; df.describe()
Quantity UnitPrice CustomerID
count 392692.000000 392692.000000 392692.000000
mean 13.119702 3.125914 15287.843865
std 180.492832 22.241836 1713.539549
min 1.000000 0.001000 12346.000000
25% 2.000000 1.250000 13955.000000
50% 6.000000 1.950000 15150.000000
75% 12.000000 3.750000 16791.000000
max 80995.000000 8142.750000 18287.000000
df.shape
(392692, 8)

Cohort Analysis#

We need to create labels

Invoice period - Year & month of a single transaction

Cohort group - Year & month of customer first purchase

Cohort period/ Cohort index - Customer stage in its lifetime(int). It is number of months passed since first purchase

df.nunique()
InvoiceNo      18532
StockCode       3665
Description     3877
Quantity         301
InvoiceDate    17282
UnitPrice        440
CustomerID      4338
Country           37
dtype: int64
sample = df[df['CustomerID'].isin([12347.0, 18283.0, 18287.0])].reset_index(drop=True); sample
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 537626 85116 BLACK CANDELABRA T-LIGHT HOLDER 12 2010-12-07 14:57:00 2.10 12347.0 Iceland
1 537626 22375 AIRLINE BAG VINTAGE JET SET BROWN 4 2010-12-07 14:57:00 4.25 12347.0 Iceland
2 537626 71477 COLOUR GLASS. STAR T-LIGHT HOLDER 12 2010-12-07 14:57:00 3.25 12347.0 Iceland
3 537626 22492 MINI PAINT SET VINTAGE 36 2010-12-07 14:57:00 0.65 12347.0 Iceland
4 537626 22771 CLEAR DRAWER KNOB ACRYLIC EDWARDIAN 12 2010-12-07 14:57:00 1.25 12347.0 Iceland
... ... ... ... ... ... ... ... ...
968 581180 20719 WOODLAND CHARLOTTE BAG 10 2011-12-07 15:52:00 0.85 12347.0 Iceland
969 581180 21265 PINK GOOSE FEATHER TREE 60CM 12 2011-12-07 15:52:00 1.95 12347.0 Iceland
970 581180 23271 CHRISTMAS TABLE SILVER CANDLE SPIKE 16 2011-12-07 15:52:00 0.83 12347.0 Iceland
971 581180 23506 MINI PLAYING CARDS SPACEBOY 20 2011-12-07 15:52:00 0.42 12347.0 Iceland
972 581180 23508 MINI PLAYING CARDS DOLLY GIRL 20 2011-12-07 15:52:00 0.42 12347.0 Iceland

973 rows × 8 columns

sample['InvoiceMonth'] = sample['InvoiceDate'].apply(lambda x: dt.datetime(x.year, x.month, 1)); sample.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceMonth
0 537626 85116 BLACK CANDELABRA T-LIGHT HOLDER 12 2010-12-07 14:57:00 2.10 12347.0 Iceland 2010-12-01
1 537626 22375 AIRLINE BAG VINTAGE JET SET BROWN 4 2010-12-07 14:57:00 4.25 12347.0 Iceland 2010-12-01
2 537626 71477 COLOUR GLASS. STAR T-LIGHT HOLDER 12 2010-12-07 14:57:00 3.25 12347.0 Iceland 2010-12-01
3 537626 22492 MINI PAINT SET VINTAGE 36 2010-12-07 14:57:00 0.65 12347.0 Iceland 2010-12-01
4 537626 22771 CLEAR DRAWER KNOB ACRYLIC EDWARDIAN 12 2010-12-07 14:57:00 1.25 12347.0 Iceland 2010-12-01
# sample['CohortMonth'] = 
sample['CohortMonth'] = sample.groupby('CustomerID')['InvoiceMonth'].transform('min'); sample.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceMonth CohortMonth
0 537626 85116 BLACK CANDELABRA T-LIGHT HOLDER 12 2010-12-07 14:57:00 2.10 12347.0 Iceland 2010-12-01 2010-12-01
1 537626 22375 AIRLINE BAG VINTAGE JET SET BROWN 4 2010-12-07 14:57:00 4.25 12347.0 Iceland 2010-12-01 2010-12-01
2 537626 71477 COLOUR GLASS. STAR T-LIGHT HOLDER 12 2010-12-07 14:57:00 3.25 12347.0 Iceland 2010-12-01 2010-12-01
3 537626 22492 MINI PAINT SET VINTAGE 36 2010-12-07 14:57:00 0.65 12347.0 Iceland 2010-12-01 2010-12-01
4 537626 22771 CLEAR DRAWER KNOB ACRYLIC EDWARDIAN 12 2010-12-07 14:57:00 1.25 12347.0 Iceland 2010-12-01 2010-12-01
sample['CohortIndex'] = (sample['InvoiceMonth'].dt.year-sample['CohortMonth'].dt.year)*12\
                       + sample['InvoiceMonth'].dt.month-sample['CohortMonth'].dt.month\
                       + 1; sample.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceMonth CohortMonth CohortIndex
0 537626 85116 BLACK CANDELABRA T-LIGHT HOLDER 12 2010-12-07 14:57:00 2.10 12347.0 Iceland 2010-12-01 2010-12-01 1
1 537626 22375 AIRLINE BAG VINTAGE JET SET BROWN 4 2010-12-07 14:57:00 4.25 12347.0 Iceland 2010-12-01 2010-12-01 1
2 537626 71477 COLOUR GLASS. STAR T-LIGHT HOLDER 12 2010-12-07 14:57:00 3.25 12347.0 Iceland 2010-12-01 2010-12-01 1
3 537626 22492 MINI PAINT SET VINTAGE 36 2010-12-07 14:57:00 0.65 12347.0 Iceland 2010-12-01 2010-12-01 1
4 537626 22771 CLEAR DRAWER KNOB ACRYLIC EDWARDIAN 12 2010-12-07 14:57:00 1.25 12347.0 Iceland 2010-12-01 2010-12-01 1
sample_data = sample.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].apply(pd.Series.nunique).reset_index();
sample_data
CohortMonth CohortIndex CustomerID
0 2010-12-01 1 1
1 2010-12-01 2 1
2 2010-12-01 5 1
3 2010-12-01 7 1
4 2010-12-01 9 1
5 2010-12-01 11 1
6 2010-12-01 13 1
7 2011-01-01 1 1
8 2011-01-01 2 1
9 2011-01-01 4 1
10 2011-01-01 5 1
11 2011-01-01 6 1
12 2011-01-01 7 1
13 2011-01-01 9 1
14 2011-01-01 10 1
15 2011-01-01 11 1
16 2011-01-01 12 1
17 2011-05-01 1 1
18 2011-05-01 6 1
sample_data.pivot(index='CohortMonth', 
                  columns='CohortIndex',
                  values='CustomerID')
CohortIndex 1 2 4 5 6 7 9 10 11 12 13
CohortMonth
2010-12-01 1.0 1.0 NaN 1.0 NaN 1.0 1.0 NaN 1.0 NaN 1.0
2011-01-01 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 NaN
2011-05-01 1.0 NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN
df['InvoiceMonth'] = df['InvoiceDate'].apply(lambda x: dt.datetime(x.year, x.month, 1))
# df['InvoiceMonth'].unique()
df['CohortMonth'] = df.groupby('CustomerID')['InvoiceMonth'].transform('min')
# df['CohortMonth'].unique()
df['CohortIndex'] = (df['InvoiceMonth'].dt.year-df['CohortMonth'].dt.year)*12\
                       + df['InvoiceMonth'].dt.month-df['CohortMonth'].dt.month\
                       + 1; df.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceMonth CohortMonth CohortIndex
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 2010-12-01 2010-12-01 1
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 2010-12-01 1
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 2010-12-01 2010-12-01 1
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 2010-12-01 1
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 2010-12-01 1
df.CohortIndex.unique()
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13])

Retention Rate Table#

cohort_counts = df.groupby(['CohortMonth', 'CohortIndex'])[['CustomerID']].count().unstack(); cohort_counts
CustomerID
CohortIndex 1 2 3 4 5 6 7 8 9 10 11 12 13
CohortMonth
2010-12-01 25670.0 10111.0 8689.0 11121.0 9628.0 11946.0 11069.0 11312.0 11316.0 14098.0 13399.0 21677.0 7173.0
2011-01-01 10877.0 2191.0 3012.0 2290.0 3603.0 3214.0 2776.0 2844.0 3768.0 4987.0 6248.0 1334.0 NaN
2011-02-01 8826.0 1388.0 1909.0 2487.0 2266.0 2012.0 2241.0 2720.0 2940.0 2916.0 451.0 NaN NaN
2011-03-01 11349.0 1421.0 2598.0 2372.0 2435.0 2103.0 2942.0 3528.0 4214.0 967.0 NaN NaN NaN
2011-04-01 7185.0 1398.0 1284.0 1296.0 1343.0 2007.0 1869.0 2130.0 513.0 NaN NaN NaN NaN
2011-05-01 6041.0 1075.0 906.0 917.0 1493.0 2329.0 1949.0 764.0 NaN NaN NaN NaN NaN
2011-06-01 5646.0 905.0 707.0 1511.0 1738.0 2545.0 616.0 NaN NaN NaN NaN NaN NaN
2011-07-01 4938.0 501.0 1314.0 1336.0 1760.0 517.0 NaN NaN NaN NaN NaN NaN NaN
2011-08-01 4818.0 1591.0 2831.0 2801.0 899.0 NaN NaN NaN NaN NaN NaN NaN NaN
2011-09-01 8225.0 2336.0 2608.0 862.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-10-01 11500.0 3499.0 869.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-11-01 10821.0 1100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-01 961.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
cohort_size = cohort_counts.iloc[:,0]; cohort_size
CohortMonth
2010-12-01    25670.0
2011-01-01    10877.0
2011-02-01     8826.0
2011-03-01    11349.0
2011-04-01     7185.0
2011-05-01     6041.0
2011-06-01     5646.0
2011-07-01     4938.0
2011-08-01     4818.0
2011-09-01     8225.0
2011-10-01    11500.0
2011-11-01    10821.0
2011-12-01      961.0
Name: (CustomerID, 1), dtype: float64
retention = (cohort_counts.divide(cohort_size, axis=0)*100).round(3)
retention.columns = retention.columns.droplevel()
retention.index = retention.index.astype('str')
retention
CohortIndex 1 2 3 4 5 6 7 8 9 10 11 12 13
CohortMonth
2010-12-01 100.0 39.388 33.849 43.323 37.507 46.537 43.120 44.067 44.083 54.920 52.197 84.445 27.943
2011-01-01 100.0 20.143 27.691 21.054 33.125 29.549 25.522 26.147 34.642 45.849 57.442 12.264 NaN
2011-02-01 100.0 15.726 21.629 28.178 25.674 22.796 25.391 30.818 33.311 33.039 5.110 NaN NaN
2011-03-01 100.0 12.521 22.892 20.901 21.456 18.530 25.923 31.086 37.131 8.521 NaN NaN NaN
2011-04-01 100.0 19.457 17.871 18.038 18.692 27.933 26.013 29.645 7.140 NaN NaN NaN NaN
2011-05-01 100.0 17.795 14.998 15.180 24.714 38.553 32.263 12.647 NaN NaN NaN NaN NaN
2011-06-01 100.0 16.029 12.522 26.762 30.783 45.076 10.910 NaN NaN NaN NaN NaN NaN
2011-07-01 100.0 10.146 26.610 27.055 35.642 10.470 NaN NaN NaN NaN NaN NaN NaN
2011-08-01 100.0 33.022 58.759 58.136 18.659 NaN NaN NaN NaN NaN NaN NaN NaN
2011-09-01 100.0 28.401 31.708 10.480 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-10-01 100.0 30.426 7.557 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-11-01 100.0 10.165 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-01 100.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
fig, ax = plt.subplots(1,1, figsize=(12,8))
sns.heatmap(data=retention, annot=True, fmt='0.0f', ax =ax, vmin = 0.0,vmax = 100,cmap="BuPu_r").set(title='Retention Rate')
plt.show()
../../_images/10_cohort_analysis_rfm_segmentation_34_0.png

Average Quantity Purchased#

cohort_counts = df.groupby(['CohortMonth', 'CohortIndex'])[['Quantity']].mean().unstack(); cohort_counts
Quantity
CohortIndex 1 2 3 4 5 6 7 8 9 10 11 12 13
CohortMonth
2010-12-01 12.117180 15.670062 15.725860 15.931121 13.625364 14.922736 16.113199 15.638083 18.207405 17.700028 19.047018 13.599991 15.383382
2011-01-01 17.471086 13.471931 12.707503 15.283843 12.845407 15.388923 14.974063 14.991561 11.628981 10.623621 9.597151 10.184408 NaN
2011-02-01 11.201903 13.740634 19.032478 12.045838 12.335834 12.330517 13.571174 13.401471 10.965646 12.416324 13.390244 NaN NaN
2011-03-01 9.962552 11.741027 13.310624 10.120573 13.756057 13.014265 13.456492 13.851474 11.324869 9.700103 NaN NaN NaN
2011-04-01 10.043702 10.417740 9.772586 11.870370 11.962770 8.691579 10.001070 9.678404 7.567251 NaN NaN NaN NaN
2011-05-01 11.457044 9.745116 14.224062 12.757906 11.217013 8.758695 10.764495 113.763089 NaN NaN NaN NaN NaN
2011-06-01 10.664896 14.727072 10.869873 13.663137 10.690449 9.960707 9.506494 NaN NaN NaN NaN NaN NaN
2011-07-01 9.921021 13.750499 7.398021 8.178144 6.213636 7.164410 NaN NaN NaN NaN NaN NaN NaN
2011-08-01 10.083230 6.199246 5.440127 6.150660 7.056730 NaN NaN NaN NaN NaN NaN NaN NaN
2011-09-01 12.138359 6.316353 8.090107 8.959397 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-10-01 8.999304 7.275221 8.492520 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-11-01 7.906848 10.027273 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-01 15.185224 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
avg_quantity = cohort_counts
avg_quantity.columns = avg_quantity.columns.droplevel()
avg_quantity.index = avg_quantity.index.astype('str')
avg_quantity
CohortIndex 1 2 3 4 5 6 7 8 9 10 11 12 13
CohortMonth
2010-12-01 12.117180 15.670062 15.725860 15.931121 13.625364 14.922736 16.113199 15.638083 18.207405 17.700028 19.047018 13.599991 15.383382
2011-01-01 17.471086 13.471931 12.707503 15.283843 12.845407 15.388923 14.974063 14.991561 11.628981 10.623621 9.597151 10.184408 NaN
2011-02-01 11.201903 13.740634 19.032478 12.045838 12.335834 12.330517 13.571174 13.401471 10.965646 12.416324 13.390244 NaN NaN
2011-03-01 9.962552 11.741027 13.310624 10.120573 13.756057 13.014265 13.456492 13.851474 11.324869 9.700103 NaN NaN NaN
2011-04-01 10.043702 10.417740 9.772586 11.870370 11.962770 8.691579 10.001070 9.678404 7.567251 NaN NaN NaN NaN
2011-05-01 11.457044 9.745116 14.224062 12.757906 11.217013 8.758695 10.764495 113.763089 NaN NaN NaN NaN NaN
2011-06-01 10.664896 14.727072 10.869873 13.663137 10.690449 9.960707 9.506494 NaN NaN NaN NaN NaN NaN
2011-07-01 9.921021 13.750499 7.398021 8.178144 6.213636 7.164410 NaN NaN NaN NaN NaN NaN NaN
2011-08-01 10.083230 6.199246 5.440127 6.150660 7.056730 NaN NaN NaN NaN NaN NaN NaN NaN
2011-09-01 12.138359 6.316353 8.090107 8.959397 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-10-01 8.999304 7.275221 8.492520 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-11-01 7.906848 10.027273 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2011-12-01 15.185224 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
avg_quantity.max().max()
113.7630890052356
fig, ax = plt.subplots(1,1, figsize=(12,8))
sns.heatmap(data=avg_quantity, annot=True, fmt='0.0f', ax =ax, vmin = 0.0,vmax = 20,cmap="BuGn_r").set(title='Avg Quantity')
plt.show()
../../_images/10_cohort_analysis_rfm_segmentation_39_0.png

RFM Analysis#

The RFM values can be grouped in several ways:

1.Percentiles e.g. quantiles

2.Pareto 80/20 cut

3.Custom - based on business knowledge

We are going to implement percentile-based grouping.

Process of calculating percentiles:

  • Sort customers based on that metric

  • Break customers into a pre-defined number of groups of equal size

  • Assign a label to each group

  • Recency, Frequency and Monetary Value Calculations

  • Recency

    • When was last order?

    • Number of days since last purchase/ last visit/ last login

  • Frequency

    • Number of purchases in given period (3 - 6 or 12 months)

    • How many or how often customer used the product of company

    • Bigger Value => More engaged customer

    • Not VIP [ Need to associate to monetary value for that]

  • Monetary

    • Total amount of money spent in period selected above

    • Differentiate between MVP/ VIP

df['TotalSum'] = df['UnitPrice']*df['Quantity']; df.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country InvoiceMonth CohortMonth CohortIndex TotalSum
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom 2010-12-01 2010-12-01 1 15.30
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 2010-12-01 1 20.34
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 2010-12-01 2010-12-01 1 22.00
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 2010-12-01 1 20.34
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 2010-12-01 2010-12-01 1 20.34
snapshot_date = df['InvoiceDate'].max()+ dt.timedelta(days=1); snapshot_date
Timestamp('2011-12-10 12:50:00')
rfm = df.groupby(['CustomerID']).agg({'InvoiceDate': lambda x: (snapshot_date -x.max()).days, 
                                      'InvoiceNo':'count',
                                      'TotalSum':'sum'})
rfm
InvoiceDate InvoiceNo TotalSum
CustomerID
12346.0 326 1 77183.60
12347.0 2 182 4310.00
12348.0 75 31 1797.24
12349.0 19 73 1757.55
12350.0 310 17 334.40
... ... ... ...
18280.0 278 10 180.60
18281.0 181 7 80.82
18282.0 8 12 178.05
18283.0 4 721 2045.53
18287.0 43 70 1837.28

4338 rows × 3 columns

rfm.rename(columns={'InvoiceDate':'Recency',
                    'InvoiceNo': 'Frequency',
                    'TotalSum': 'MonetaryValue'},
           inplace = True,
        )
rfm
Recency Frequency MonetaryValue
CustomerID
12346.0 326 1 77183.60
12347.0 2 182 4310.00
12348.0 75 31 1797.24
12349.0 19 73 1757.55
12350.0 310 17 334.40
... ... ... ...
18280.0 278 10 180.60
18281.0 181 7 80.82
18282.0 8 12 178.05
18283.0 4 721 2045.53
18287.0 43 70 1837.28

4338 rows × 3 columns

Recency

  • Better rating to customer who have been active more recently

Frequency & Monetary Value

  • Different rating / higher label (than above)-we want to spend more money & visit more often

Now let’s see the magic happen