Contents

Customer Analytics

Contents

Customer Analytics#

Introduction#

What we will cover in the course?

KYC
Purchase Probability
Brand Probability
Quantity to be purchased

import tensorflow as tf
import seaborn 
import pickle 
import sklearn
import torch
import pandas as pd

2022-05-15 08:13:50.540344: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory

torch.cuda.is_available()

True

Dataset#

Demographic Data for Segmentation#

df = pd.read_csv("segmentation data.csv")

df.head()

	ID	Sex	Marital status	Age	Education	Income	Occupation	Settlement size
0	100000001	0	0	67	2	124670	1	2
1	100000002	1	1	22	1	150773	1	2
2	100000003	0	0	49	1	89210	0	0
3	100000004	0	0	45	1	171565	1	1
4	100000005	0	0	53	1	149031	1	1

df.nunique()

ID                 2000
Sex                   2
Marital status        2
Age                  58
Education             4
Income             1982
Occupation            3
Settlement size       3
dtype: int64

pd.set_option('display.max_colwidth', 0)
df_legend=pd.read_excel("segmentation data legend.xlsx", skiprows=2, header=1).dropna(how="all", axis=1).dropna(how="all", axis=0).fillna("")
df_legend.style.hide_index()

Variable	Data type	Range	Description
ID	numerical	Integer	Shows a unique identificator of a customer.
Sex	categorical	{0,1}	Biological sex (gender) of a customer. In this dataset there are only 2 different options.
		0	male
		1	female
Marital status	categorical	{0,1}	Marital status of a customer.
		0	single
		1	non-single (divorced / separated / married / widowed)
Age	numerical	Integer	The age of the customer in years, calculated as current year minus the year of birth of the customer at the time of creation of the dataset
		18	Min value (the lowest age observed in the dataset)
		76	Max value (the highest age observed in the dataset)
Education	categorical	{0,1,2,3}	Level of education of the customer
		0	other / unknown
		1	high school
		2	university
		3	graduate school
Income	numerical	Real	Self-reported annual income in US dollars of the customer.
		35832	Min value (the lowest income observed in the dataset)
		309364	Max value (the highest income observed in the dataset)
Occupation	categorical	{0,1,2}	Category of occupation of the customer.
		0	unemployed / unskilled
		1	skilled employee / official
		2	management / self-employed / highly qualified employee / officer
Settlement size	categorical	{0,1,2}	The size of the city that the customer lives in.
		0	small city
		1	mid-sized city
		2	big city

Purchase History#

df = pd.read_csv("purchase data.csv"); df.head()

	ID	Day	Incidence	Brand	Quantity	Last_Inc_Brand	Last_Inc_Quantity	Price_1	Price_2	Price_3	...	Promotion_3	Promotion_4	Promotion_5	Sex	Marital status	Age	Education	Income	Occupation	Settlement size
0	200000001	1	0	0	0	0	0	1.59	1.87	2.01	...	0	0	0	0	0	47	1	110866	1	0
1	200000001	11	0	0	0	0	0	1.51	1.89	1.99	...	0	0	0	0	0	47	1	110866	1	0
2	200000001	12	0	0	0	0	0	1.51	1.89	1.99	...	0	0	0	0	0	47	1	110866	1	0
3	200000001	16	0	0	0	0	0	1.52	1.89	1.98	...	0	0	0	0	0	47	1	110866	1	0
4	200000001	18	0	0	0	0	0	1.52	1.89	1.99	...	0	0	0	0	0	47	1	110866	1	0

5 rows × 24 columns

df.shape

(58693, 24)

df.nunique()

ID                   500
Day                  730
Incidence            2  
Brand                6  
Quantity             16 
Last_Inc_Brand       6  
Last_Inc_Quantity    2  
Price_1              37 
Price_2              30 
Price_3              21 
Price_4              26 
Price_5              44 
Promotion_1          2  
Promotion_2          2  
Promotion_3          2  
Promotion_4          2  
Promotion_5          2  
Sex                  2  
Marital status       2  
Age                  56 
Education            4  
Income               499
Occupation           3  
Settlement size      3  
dtype: int64

pd.set_option('display.max_colwidth', 0)
df_legend=pd.read_excel("purchase data legend.xlsx", skiprows=2, header=1).dropna(how="all", axis=1).dropna(how="all", axis=0).fillna("")
df_legend.style.hide_index()

Variable	Data type	Range	Description
ID	numerical	Integer	Shows a unique identificator of a customer.
Day	numerical	Integer	Day when the customer has visited the store
Incidence	categorical	{0,1}	Purchase Incidence
		0	The customer has not purchased an item from the category of interest
		1	The customer has purchased an item from the category of interest
Brand	categorical	{0,1,2,3,4,5}	Shows which brand the customer has purchased
		0	No brand was purchased
		1,2,3,4,5	Brand ID
Quantity	numerical	integer	Number of items bought by the customer from the product category of interest
Last_Inc_Brand	categorical	{0,1,2,3,4,5}	Shows which brand the customer has purchased on their previous store visit
		0	No brand was purchased
		1,2,3,4,5	Brand ID
Last_Inc_Quantity	numerical	integer	Number of items bought by the customer from the product category of interest during their previous store visit
Price_1	numerical	real	Price of an item from Brand 1 on a particular day
Price_2	numerical	real	Price of an item from Brand 2 on a particular day
Price_3	numerical	real	Price of an item from Brand 3 on a particular day
Price_4	numerical	real	Price of an item from Brand 4 on a particular day
Price_5	numerical	real	Price of an item from Brand 5 on a particular day
Promotion_1	categorical	{0,1}	Indicator whether Brand 1 was on promotion or not on a particular day
		0	There is no promotion
		1	There is promotion
Promotion_2	categorical	{0,1}	Indicator of whether Brand 2 was on promotion or not on a particular day
		0	There is no promotion
		1	There is promotion
Promotion_3	categorical	{0,1}	Indicator of whether Brand 3 was on promotion or not on a particular day
		0	There is no promotion
		1	There is promotion
Promotion_4	categorical	{0,1}	Indicator of whether Brand 4 was on promotion or not on a particular day
		0	There is no promotion
		1	There is promotion
Promotion_5	categorical	{0,1}	Indicator of whether Brand 5 was on promotion or not on a particular day
		0	There is no promotion
		1	There is promotion
Sex	categorical	{0,1}	Biological sex (gender) of a customer. In this dataset there are only 2 different options.
		0	male
		1	female
Marital status	categorical	{0,1}	Marital status of a customer.
		0	single
		1	non-single (divorced / separated / married / widowed)
Age	numerical	Integer	The age of the customer in years, calculated as current year minus the year of birth of the customer at the time of creation of the dataset
		18	Min value (the lowest age observed in the dataset)
		75	Max value (the highest age observed in the dataset)
Education	categorical	{0,1,2,3}	Level of education of the customer
		0	other / unknown
		1	high school
		2	university
		3	graduate school
Income	numerical	real	Self-reported annual income in US dollars of the customer.
		38247	Min value (the lowest income observed in the dataset)
		309364	Max value (the highest income observed in the dataset)
Occupation	categorical	{0,1,2}	Category of occupation of the customer.
		0	unemployed / unskilled
		1	skilled employee / official
		2	management / self-employed / highly qualified employee / officer
Settlement size	categorical	{0,1,2}	The size of the city that the customer lives in.
		0	small city
		1	mid-sized city
		2	big city

df.isnull().sum() # No missing information

ID                   0
Day                  0
Incidence            0
Brand                0
Quantity             0
Last_Inc_Brand       0
Last_Inc_Quantity    0
Price_1              0
Price_2              0
Price_3              0
Price_4              0
Price_5              0
Promotion_1          0
Promotion_2          0
Promotion_3          0
Promotion_4          0
Promotion_5          0
Sex                  0
Marital status       0
Age                  0
Education            0
Income               0
Occupation           0
Settlement size      0
dtype: int64