Customer Analytics
Contents
Customer Analytics#
Introduction#
What we will cover in the course?
KYC
Purchase Probability
Brand Probability
Quantity to be purchased
import tensorflow as tf
import seaborn
import pickle
import sklearn
import torch
import pandas as pd
2022-05-15 08:13:50.540344: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
torch.cuda.is_available()
True
Dataset#
Demographic Data for Segmentation#
df = pd.read_csv("segmentation data.csv")
df.head()
ID | Sex | Marital status | Age | Education | Income | Occupation | Settlement size | |
---|---|---|---|---|---|---|---|---|
0 | 100000001 | 0 | 0 | 67 | 2 | 124670 | 1 | 2 |
1 | 100000002 | 1 | 1 | 22 | 1 | 150773 | 1 | 2 |
2 | 100000003 | 0 | 0 | 49 | 1 | 89210 | 0 | 0 |
3 | 100000004 | 0 | 0 | 45 | 1 | 171565 | 1 | 1 |
4 | 100000005 | 0 | 0 | 53 | 1 | 149031 | 1 | 1 |
df.nunique()
ID 2000
Sex 2
Marital status 2
Age 58
Education 4
Income 1982
Occupation 3
Settlement size 3
dtype: int64
pd.set_option('display.max_colwidth', 0)
df_legend=pd.read_excel("segmentation data legend.xlsx", skiprows=2, header=1).dropna(how="all", axis=1).dropna(how="all", axis=0).fillna("")
df_legend.style.hide_index()
Variable | Data type | Range | Description |
---|---|---|---|
ID | numerical | Integer | Shows a unique identificator of a customer. |
Sex | categorical | {0,1} | Biological sex (gender) of a customer. In this dataset there are only 2 different options. |
0 | male | ||
1 | female | ||
Marital status | categorical | {0,1} | Marital status of a customer. |
0 | single | ||
1 | non-single (divorced / separated / married / widowed) | ||
Age | numerical | Integer | The age of the customer in years, calculated as current year minus the year of birth of the customer at the time of creation of the dataset |
18 | Min value (the lowest age observed in the dataset) | ||
76 | Max value (the highest age observed in the dataset) | ||
Education | categorical | {0,1,2,3} | Level of education of the customer |
0 | other / unknown | ||
1 | high school | ||
2 | university | ||
3 | graduate school | ||
Income | numerical | Real | Self-reported annual income in US dollars of the customer. |
35832 | Min value (the lowest income observed in the dataset) | ||
309364 | Max value (the highest income observed in the dataset) | ||
Occupation | categorical | {0,1,2} | Category of occupation of the customer. |
0 | unemployed / unskilled | ||
1 | skilled employee / official | ||
2 | management / self-employed / highly qualified employee / officer | ||
Settlement size | categorical | {0,1,2} | The size of the city that the customer lives in. |
0 | small city | ||
1 | mid-sized city | ||
2 | big city |
Purchase History#
df = pd.read_csv("purchase data.csv"); df.head()
ID | Day | Incidence | Brand | Quantity | Last_Inc_Brand | Last_Inc_Quantity | Price_1 | Price_2 | Price_3 | ... | Promotion_3 | Promotion_4 | Promotion_5 | Sex | Marital status | Age | Education | Income | Occupation | Settlement size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 200000001 | 1 | 0 | 0 | 0 | 0 | 0 | 1.59 | 1.87 | 2.01 | ... | 0 | 0 | 0 | 0 | 0 | 47 | 1 | 110866 | 1 | 0 |
1 | 200000001 | 11 | 0 | 0 | 0 | 0 | 0 | 1.51 | 1.89 | 1.99 | ... | 0 | 0 | 0 | 0 | 0 | 47 | 1 | 110866 | 1 | 0 |
2 | 200000001 | 12 | 0 | 0 | 0 | 0 | 0 | 1.51 | 1.89 | 1.99 | ... | 0 | 0 | 0 | 0 | 0 | 47 | 1 | 110866 | 1 | 0 |
3 | 200000001 | 16 | 0 | 0 | 0 | 0 | 0 | 1.52 | 1.89 | 1.98 | ... | 0 | 0 | 0 | 0 | 0 | 47 | 1 | 110866 | 1 | 0 |
4 | 200000001 | 18 | 0 | 0 | 0 | 0 | 0 | 1.52 | 1.89 | 1.99 | ... | 0 | 0 | 0 | 0 | 0 | 47 | 1 | 110866 | 1 | 0 |
5 rows × 24 columns
df.shape
(58693, 24)
df.nunique()
ID 500
Day 730
Incidence 2
Brand 6
Quantity 16
Last_Inc_Brand 6
Last_Inc_Quantity 2
Price_1 37
Price_2 30
Price_3 21
Price_4 26
Price_5 44
Promotion_1 2
Promotion_2 2
Promotion_3 2
Promotion_4 2
Promotion_5 2
Sex 2
Marital status 2
Age 56
Education 4
Income 499
Occupation 3
Settlement size 3
dtype: int64
pd.set_option('display.max_colwidth', 0)
df_legend=pd.read_excel("purchase data legend.xlsx", skiprows=2, header=1).dropna(how="all", axis=1).dropna(how="all", axis=0).fillna("")
df_legend.style.hide_index()
Variable | Data type | Range | Description |
---|---|---|---|
ID | numerical | Integer | Shows a unique identificator of a customer. |
Day | numerical | Integer | Day when the customer has visited the store |
Incidence | categorical | {0,1} | Purchase Incidence |
0 | The customer has not purchased an item from the category of interest | ||
1 | The customer has purchased an item from the category of interest | ||
Brand | categorical | {0,1,2,3,4,5} | Shows which brand the customer has purchased |
0 | No brand was purchased | ||
1,2,3,4,5 | Brand ID | ||
Quantity | numerical | integer | Number of items bought by the customer from the product category of interest |
Last_Inc_Brand | categorical | {0,1,2,3,4,5} | Shows which brand the customer has purchased on their previous store visit |
0 | No brand was purchased | ||
1,2,3,4,5 | Brand ID | ||
Last_Inc_Quantity | numerical | integer | Number of items bought by the customer from the product category of interest during their previous store visit |
Price_1 | numerical | real | Price of an item from Brand 1 on a particular day |
Price_2 | numerical | real | Price of an item from Brand 2 on a particular day |
Price_3 | numerical | real | Price of an item from Brand 3 on a particular day |
Price_4 | numerical | real | Price of an item from Brand 4 on a particular day |
Price_5 | numerical | real | Price of an item from Brand 5 on a particular day |
Promotion_1 | categorical | {0,1} | Indicator whether Brand 1 was on promotion or not on a particular day |
0 | There is no promotion | ||
1 | There is promotion | ||
Promotion_2 | categorical | {0,1} | Indicator of whether Brand 2 was on promotion or not on a particular day |
0 | There is no promotion | ||
1 | There is promotion | ||
Promotion_3 | categorical | {0,1} | Indicator of whether Brand 3 was on promotion or not on a particular day |
0 | There is no promotion | ||
1 | There is promotion | ||
Promotion_4 | categorical | {0,1} | Indicator of whether Brand 4 was on promotion or not on a particular day |
0 | There is no promotion | ||
1 | There is promotion | ||
Promotion_5 | categorical | {0,1} | Indicator of whether Brand 5 was on promotion or not on a particular day |
0 | There is no promotion | ||
1 | There is promotion | ||
Sex | categorical | {0,1} | Biological sex (gender) of a customer. In this dataset there are only 2 different options. |
0 | male | ||
1 | female | ||
Marital status | categorical | {0,1} | Marital status of a customer. |
0 | single | ||
1 | non-single (divorced / separated / married / widowed) | ||
Age | numerical | Integer | The age of the customer in years, calculated as current year minus the year of birth of the customer at the time of creation of the dataset |
18 | Min value (the lowest age observed in the dataset) | ||
75 | Max value (the highest age observed in the dataset) | ||
Education | categorical | {0,1,2,3} | Level of education of the customer |
0 | other / unknown | ||
1 | high school | ||
2 | university | ||
3 | graduate school | ||
Income | numerical | real | Self-reported annual income in US dollars of the customer. |
38247 | Min value (the lowest income observed in the dataset) | ||
309364 | Max value (the highest income observed in the dataset) | ||
Occupation | categorical | {0,1,2} | Category of occupation of the customer. |
0 | unemployed / unskilled | ||
1 | skilled employee / official | ||
2 | management / self-employed / highly qualified employee / officer | ||
Settlement size | categorical | {0,1,2} | The size of the city that the customer lives in. |
0 | small city | ||
1 | mid-sized city | ||
2 | big city |
df.isnull().sum() # No missing information
ID 0
Day 0
Incidence 0
Brand 0
Quantity 0
Last_Inc_Brand 0
Last_Inc_Quantity 0
Price_1 0
Price_2 0
Price_3 0
Price_4 0
Price_5 0
Promotion_1 0
Promotion_2 0
Promotion_3 0
Promotion_4 0
Promotion_5 0
Sex 0
Marital status 0
Age 0
Education 0
Income 0
Occupation 0
Settlement size 0
dtype: int64