Descriptive Analytics:Pasta#

Imports#

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import seaborn as sns 
import plotnine
sns.set()

Read Data#

df = pd.read_csv("W101_PASTAPURCHASE_EDITED.csv"); df.head()
HHID TIME PASTA EXPOS AGE INCOME AREA
0 1 1 0.939444 1 61.710758 25186.798772 3
1 1 2 2.560969 2 61.710758 25186.798772 3
2 1 3 0.901123 0 61.710758 25186.798772 3
3 1 4 1.916530 1 61.710758 25186.798772 3
4 1 5 1.548751 0 61.710758 25186.798772 3
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   HHID    40000 non-null  int64  
 1   TIME    40000 non-null  int64  
 2   PASTA   40000 non-null  float64
 3   EXPOS   40000 non-null  int64  
 4   AGE     40000 non-null  float64
 5   INCOME  40000 non-null  float64
 6   AREA    40000 non-null  int64  
dtypes: float64(3), int64(4)
memory usage: 2.1 MB
df.describe()
HHID TIME PASTA EXPOS AGE INCOME AREA
count 40000.000000 40000.000000 40000.000000 40000.000000 40000.000000 40000.000000 40000.000000
mean 1000.500000 10.500000 1.841545 0.499100 42.251409 27096.126112 2.996000
std 577.357414 5.766353 1.025911 0.702041 13.766646 15928.188291 1.409622
min 1.000000 1.000000 0.000000 0.000000 18.007529 609.107170 1.000000
25% 500.750000 5.750000 1.108936 0.000000 30.505706 15379.591862 2.000000
50% 1000.500000 10.500000 1.811417 0.000000 42.436092 24487.068020 3.000000
75% 1500.250000 15.250000 2.534365 1.000000 54.362395 35272.533787 4.000000
max 2000.000000 20.000000 6.592769 5.000000 64.984126 141066.966883 5.000000
df_agg = df.groupby("AREA")['INCOME'].agg(['min', 'max', 'mean'])
df_agg.sort_values(by='mean')
min max mean
AREA
3 2353.176959 111295.840727 25879.485698
5 1331.260222 141066.966883 26030.910606
2 609.107170 106319.106367 26218.857948
1 2740.776521 85241.329860 28076.599968
4 1241.960729 112983.969631 29260.133137
df.groupby('HHID')['PASTA'].agg(['sum']).sort_values(by='sum', ascending=False)
sum
HHID
1493 55.361931
1511 51.030304
151 50.816654
647 50.771865
916 50.502666
... ...
92 23.926629
267 23.380097
572 23.225390
1921 21.738494
1917 19.256710

2000 rows × 1 columns

df[(df['AREA']==2) & (df['INCOME']> 20000)].groupby('HHID').agg('sum').query("PASTA>30").count()
TIME      218
PASTA     218
EXPOS     218
AGE       218
INCOME    218
AREA      218
dtype: int64
df[['PASTA', 'EXPOS']].corr()
PASTA EXPOS
PASTA 1.000000 0.326617
EXPOS 0.326617 1.000000
df.groupby('HHID')['PASTA'].sum().plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
../../_images/W101_descriptive_13_1.png
df.groupby('TIME')['PASTA'].sum().reset_index().plot.scatter(y='PASTA',x='TIME')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
<AxesSubplot:xlabel='TIME', ylabel='PASTA'>
../../_images/W101_descriptive_14_2.png