Introduction to Survival Analysis#

Imports#

import os
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gaussian_kde
sns.set()

Read Dataset#

datafile = 'lamps.csv'
if not os.path.exists(datafile):
    !wget https://gist.github.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv
--2022-05-27 12:49:56--  https://gist.github.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv
Resolving gist.github.com (gist.github.com)... 140.82.112.4
Connecting to gist.github.com (gist.github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv [following]
--2022-05-27 12:49:56--  https://gist.githubusercontent.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 410 [text/plain]
Saving to: ‘lamps.csv’

lamps.csv           100%[===================>]     410  --.-KB/s    in 0s      

2022-05-27 12:49:56 (12.8 MB/s) - ‘lamps.csv’ saved [410/410]
df = pd.read_csv("lamps.csv", index_col=0); df.head()
h f K
i
0 0 0 50
1 840 2 48
2 852 1 47
3 936 1 46
4 960 1 45
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33 entries, 0 to 32
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   h       33 non-null     int64
 1   f       33 non-null     int64
 2   K       33 non-null     int64
dtypes: int64(3)
memory usage: 1.0 KB

PMF#

pmf = df[['h', 'f']].set_index('h')
pmf.index.name = 't'
pmf.head()
f
t
0 0
840 2
852 1
936 1
960 1
pmf /= pmf.sum(); pmf.head() # PMF : Mapping of possible outcomes from event to probability
f
t
0 0.00
840 0.04
852 0.02
936 0.02
960 0.02
pmf.loc[840]
f    0.04
Name: 840, dtype: float64
pmf.loc[1524]
f    0.06
Name: 1524, dtype: float64
pmf.plot(ylabel='PMF', title='PMF of lightbulb lifetimes', xlabel='Lifetime(hours)') 
<AxesSubplot:title={'center':'PMF of lightbulb lifetimes'}, xlabel='Lifetime(hours)', ylabel='PMF'>
../../_images/01_light_bulb_14_1.png

Note

  • PMF tend to be noisy

  • Most of bulbs fails between 1000-2000 hours

CDF#

cdf = pmf.cumsum(); cdf.head()
f
t
0 0.00
840 0.04
852 0.06
936 0.08
960 0.10
cdf.plot(ylabel='CDF', title='CDF of lightbulb lifetimes', xlabel='Lifetime(hours)') 
<AxesSubplot:title={'center':'CDF of lightbulb lifetimes'}, xlabel='Lifetime(hours)', ylabel='CDF'>
../../_images/01_light_bulb_18_1.png

Note

  • CDF : Quickly compute fraction of lightbulbs that expired on or before a time

  • CDF: Computes percentile

  • CDF: Not as noisy as PMF

  • Sigmoid kind of shape

  • Very few lightbulbs die before 1000 hrs, very few survive after 2000

cdf.loc[840]
f    0.04
Name: 840, dtype: float64
cdf.loc[1524]
f    0.6
Name: 1524, dtype: float64

Survival Function#

Note

  • Complement of CDF : 1-CDF

  • CDF is fraction of the population less than or equal to t

  • survival function if the fraction (strictly) greater than t [ Fraction that exceed a given time t]

surv = 1-cdf; surv.head()
f
t
0 1.00
840 0.96
852 0.94
936 0.92
960 0.90
fig, ax = plt.subplots(1, figsize=(12,12))
cdf.plot.line(ax=ax, label='CDF')
surv.plot.line(ax=ax, label='Surv', title='Survival Function of lightbulbs', ylabel='Probability(lifetime>t)', xlabel='Lifetime(hours)')
<AxesSubplot:title={'center':'Survival Function of lightbulbs'}, xlabel='Lifetime(hours)', ylabel='Probability(lifetime>t)'>
../../_images/01_light_bulb_25_1.png

Hazard Function#

Note

  • Survival - fraction of lightbulbs that live longer than t

  • PMF - instantaneous ( here 12 hours window) fraction of lightbulb that fail at time t

  • PMF+Surv = Fraction that survive until t

  • Hazard rate fraction of ligthbulbs that survive until time t and then fail at time t. PMF(t)/(PMF(t) +SURV(t))

  • How much danger are u in

haz = pmf/(pmf+surv); haz.head()
f
t
0 0.000000
840 0.040000
852 0.020833
936 0.021277
960 0.021739
haz.plot.line( label='Haz', title='Hazard Rate of lightbul', ylabel='Probability of failure at time t (lifetime=t)', xlabel='Lifetime(hours)')
<AxesSubplot:title={'center':'Hazard Rate of lightbul'}, xlabel='Lifetime(hours)', ylabel='Probability of failure at time t (lifetime=t)'>
../../_images/01_light_bulb_29_1.png

Probability that lightbulb will die at this time | Given that it has survived till this point.

Cumulative Hazaard#

chaz = haz.cumsum(); chaz.head()
f
t
0 0.000000
840 0.040000
852 0.060833
936 0.082110
960 0.103849

Note

  • Steep: Lot of Bulbs dying at this period

  • Shallow: Rate of dying is low.

  • Sample on left is more reliable (50 bulbs), on right it’s 1 (single bvlb)-= so not reliable

  • Left reliable, right statistically noisy

Resampling#

kde = gaussian_kde(pmf.index, weights=pmf.values.flatten()) # Smooth version of pmf distribution function
size = df['f'].sum(); size
50
sample = kde.resample(size).flatten(); sample.shape
(50,)
kde
<scipy.stats.kde.gaussian_kde at 0x14b07b062f10>
pd.Series(sample).value_counts(normalize=True).sort_index()
733.096533     0.02
781.108500     0.02
785.740480     0.02
811.857852     0.02
817.881074     0.02
977.072959     0.02
979.158138     0.02
999.560712     0.02
1088.361588    0.02
1100.892566    0.02
1143.952057    0.02
1151.903406    0.02
1177.691043    0.02
1181.301166    0.02
1195.982678    0.02
1213.208901    0.02
1216.308848    0.02
1279.357095    0.02
1296.282198    0.02
1321.757254    0.02
1329.753358    0.02
1349.955486    0.02
1369.659052    0.02
1380.253830    0.02
1386.618438    0.02
1416.297723    0.02
1476.864069    0.02
1501.916118    0.02
1503.496298    0.02
1520.309951    0.02
1537.955521    0.02
1538.875501    0.02
1552.270715    0.02
1577.889025    0.02
1627.869528    0.02
1644.854050    0.02
1678.737029    0.02
1696.595028    0.02
1771.676176    0.02
1773.936031    0.02
1779.294043    0.02
1825.795276    0.02
1830.260867    0.02
1969.120723    0.02
1972.005111    0.02
1980.001136    0.02
2007.786535    0.02
2026.912789    0.02
2112.738078    0.02
2126.211653    0.02
dtype: float64
def make_hazard(sample):
    pmf = pd.Series(sample).value_counts(normalize=True).sort_index()
    cdf = pmf.cumsum()
    surv = 1-cdf
    haz = pmf/(pmf+surv)
    return pmf, cdf, surv, haz
for i in range(100):
    sample = kde.resample(size).flatten()
    _,_, sf, _ = make_hazard(sample)
    ax=sf.plot(color='gray', alpha=0.1)
surv.plot(ax=ax, xlabel='Lifetime (hours)', ylabel='Prob(Lifetime>t)', title='Survival function of lightbulb lifetimes')
<AxesSubplot:title={'center':'Survival function of lightbulb lifetimes'}, xlabel='Lifetime (hours)', ylabel='Prob(Lifetime>t)'>
../../_images/01_light_bulb_42_1.png
for i in range(100):
    sample = kde.resample(size).flatten()
    _,_, _, hf = make_hazard(sample)
    ax=hf.plot(color='gray', alpha=0.1)
haz.plot(ax=ax, xlabel='Lifetime (hours)', ylabel='Prob(Lifetime>t)', title='Survival function of lightbulb lifetimes')
<AxesSubplot:title={'center':'Survival function of lightbulb lifetimes'}, xlabel='Lifetime (hours)', ylabel='Prob(Lifetime>t)'>
../../_images/01_light_bulb_43_1.png