Customer Analytics#

Introduction#

What we will cover in the course?

  • KYC

  • Purchase Probability

  • Brand Probability

  • Quantity to be purchased

import tensorflow as tf
import seaborn 
import pickle 
import sklearn
import torch
import pandas as pd
2022-05-15 08:13:50.540344: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
torch.cuda.is_available()
True

Dataset#

Demographic Data for Segmentation#

df = pd.read_csv("segmentation data.csv")
df.head()
ID Sex Marital status Age Education Income Occupation Settlement size
0 100000001 0 0 67 2 124670 1 2
1 100000002 1 1 22 1 150773 1 2
2 100000003 0 0 49 1 89210 0 0
3 100000004 0 0 45 1 171565 1 1
4 100000005 0 0 53 1 149031 1 1
df.nunique()
ID                 2000
Sex                   2
Marital status        2
Age                  58
Education             4
Income             1982
Occupation            3
Settlement size       3
dtype: int64
pd.set_option('display.max_colwidth', 0)
df_legend=pd.read_excel("segmentation data legend.xlsx", skiprows=2, header=1).dropna(how="all", axis=1).dropna(how="all", axis=0).fillna("")
df_legend.style.hide_index()
Variable Data type Range Description
ID numerical Integer Shows a unique identificator of a customer.
Sex categorical {0,1} Biological sex (gender) of a customer. In this dataset there are only 2 different options.
0 male
1 female
Marital status categorical {0,1} Marital status of a customer.
0 single
1 non-single (divorced / separated / married / widowed)
Age numerical Integer The age of the customer in years, calculated as current year minus the year of birth of the customer at the time of creation of the dataset
18 Min value (the lowest age observed in the dataset)
76 Max value (the highest age observed in the dataset)
Education categorical {0,1,2,3} Level of education of the customer
0 other / unknown
1 high school
2 university
3 graduate school
Income numerical Real Self-reported annual income in US dollars of the customer.
35832 Min value (the lowest income observed in the dataset)
309364 Max value (the highest income observed in the dataset)
Occupation categorical {0,1,2} Category of occupation of the customer.
0 unemployed / unskilled
1 skilled employee / official
2 management / self-employed / highly qualified employee / officer
Settlement size categorical {0,1,2} The size of the city that the customer lives in.
0 small city
1 mid-sized city
2 big city

Purchase History#

df = pd.read_csv("purchase data.csv"); df.head()
ID Day Incidence Brand Quantity Last_Inc_Brand Last_Inc_Quantity Price_1 Price_2 Price_3 ... Promotion_3 Promotion_4 Promotion_5 Sex Marital status Age Education Income Occupation Settlement size
0 200000001 1 0 0 0 0 0 1.59 1.87 2.01 ... 0 0 0 0 0 47 1 110866 1 0
1 200000001 11 0 0 0 0 0 1.51 1.89 1.99 ... 0 0 0 0 0 47 1 110866 1 0
2 200000001 12 0 0 0 0 0 1.51 1.89 1.99 ... 0 0 0 0 0 47 1 110866 1 0
3 200000001 16 0 0 0 0 0 1.52 1.89 1.98 ... 0 0 0 0 0 47 1 110866 1 0
4 200000001 18 0 0 0 0 0 1.52 1.89 1.99 ... 0 0 0 0 0 47 1 110866 1 0

5 rows × 24 columns

df.shape
(58693, 24)
df.nunique()
ID                   500
Day                  730
Incidence            2  
Brand                6  
Quantity             16 
Last_Inc_Brand       6  
Last_Inc_Quantity    2  
Price_1              37 
Price_2              30 
Price_3              21 
Price_4              26 
Price_5              44 
Promotion_1          2  
Promotion_2          2  
Promotion_3          2  
Promotion_4          2  
Promotion_5          2  
Sex                  2  
Marital status       2  
Age                  56 
Education            4  
Income               499
Occupation           3  
Settlement size      3  
dtype: int64
pd.set_option('display.max_colwidth', 0)
df_legend=pd.read_excel("purchase data legend.xlsx", skiprows=2, header=1).dropna(how="all", axis=1).dropna(how="all", axis=0).fillna("")
df_legend.style.hide_index()
Variable Data type Range Description
ID numerical Integer Shows a unique identificator of a customer.
Day numerical Integer Day when the customer has visited the store
Incidence categorical {0,1} Purchase Incidence
0 The customer has not purchased an item from the category of interest
1 The customer has purchased an item from the category of interest
Brand categorical {0,1,2,3,4,5} Shows which brand the customer has purchased
0 No brand was purchased
1,2,3,4,5 Brand ID
Quantity numerical integer Number of items bought by the customer from the product category of interest
Last_Inc_Brand categorical {0,1,2,3,4,5} Shows which brand the customer has purchased on their previous store visit
0 No brand was purchased
1,2,3,4,5 Brand ID
Last_Inc_Quantity numerical integer Number of items bought by the customer from the product category of interest during their previous store visit
Price_1 numerical real Price of an item from Brand 1 on a particular day
Price_2 numerical real Price of an item from Brand 2 on a particular day
Price_3 numerical real Price of an item from Brand 3 on a particular day
Price_4 numerical real Price of an item from Brand 4 on a particular day
Price_5 numerical real Price of an item from Brand 5 on a particular day
Promotion_1 categorical {0,1} Indicator whether Brand 1 was on promotion or not on a particular day
0 There is no promotion
1 There is promotion
Promotion_2 categorical {0,1} Indicator of whether Brand 2 was on promotion or not on a particular day
0 There is no promotion
1 There is promotion
Promotion_3 categorical {0,1} Indicator of whether Brand 3 was on promotion or not on a particular day
0 There is no promotion
1 There is promotion
Promotion_4 categorical {0,1} Indicator of whether Brand 4 was on promotion or not on a particular day
0 There is no promotion
1 There is promotion
Promotion_5 categorical {0,1} Indicator of whether Brand 5 was on promotion or not on a particular day
0 There is no promotion
1 There is promotion
Sex categorical {0,1} Biological sex (gender) of a customer. In this dataset there are only 2 different options.
0 male
1 female
Marital status categorical {0,1} Marital status of a customer.
0 single
1 non-single (divorced / separated / married / widowed)
Age numerical Integer The age of the customer in years, calculated as current year minus the year of birth of the customer at the time of creation of the dataset
18 Min value (the lowest age observed in the dataset)
75 Max value (the highest age observed in the dataset)
Education categorical {0,1,2,3} Level of education of the customer
0 other / unknown
1 high school
2 university
3 graduate school
Income numerical real Self-reported annual income in US dollars of the customer.
38247 Min value (the lowest income observed in the dataset)
309364 Max value (the highest income observed in the dataset)
Occupation categorical {0,1,2} Category of occupation of the customer.
0 unemployed / unskilled
1 skilled employee / official
2 management / self-employed / highly qualified employee / officer
Settlement size categorical {0,1,2} The size of the city that the customer lives in.
0 small city
1 mid-sized city
2 big city
df.isnull().sum() # No missing information
ID                   0
Day                  0
Incidence            0
Brand                0
Quantity             0
Last_Inc_Brand       0
Last_Inc_Quantity    0
Price_1              0
Price_2              0
Price_3              0
Price_4              0
Price_5              0
Promotion_1          0
Promotion_2          0
Promotion_3          0
Promotion_4          0
Promotion_5          0
Sex                  0
Marital status       0
Age                  0
Education            0
Income               0
Occupation           0
Settlement size      0
dtype: int64