Introduction to Survival Analysis#

Imports#

import os
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import gaussian_kde

sns.set()

Read Dataset#

datafile = 'lamps.csv'
if not os.path.exists(datafile):
    !wget https://gist.github.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv

--2022-05-27 12:49:56--  https://gist.github.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv
Resolving gist.github.com (gist.github.com)... 140.82.112.4
Connecting to gist.github.com (gist.github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv [following]
--2022-05-27 12:49:56--  https://gist.githubusercontent.com/epogrebnyak/7933e16c0ad215742c4c104be4fbdeb1/raw/c932bc5b6aa6317770c4cbf43eb591511fec08f9/lamps.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 410 [text/plain]
Saving to: ‘lamps.csv’

lamps.csv           100%[===================>]     410  --.-KB/s    in 0s      

2022-05-27 12:49:56 (12.8 MB/s) - ‘lamps.csv’ saved [410/410]

df = pd.read_csv("lamps.csv", index_col=0); df.head()

	h	f	K
i
0	0	0	50
1	840	2	48
2	852	1	47
3	936	1	46
4	960	1	45

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33 entries, 0 to 32
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   h       33 non-null     int64
 1   f       33 non-null     int64
 2   K       33 non-null     int64
dtypes: int64(3)
memory usage: 1.0 KB

PMF#

pmf = df[['h', 'f']].set_index('h')
pmf.index.name = 't'

pmf.head()

	f
t
0	0
840	2
852	1
936	1
960	1

pmf /= pmf.sum(); pmf.head() # PMF : Mapping of possible outcomes from event to probability

	f
t
0	0.00
840	0.04
852	0.02
936	0.02
960	0.02

pmf.loc[840]

f    0.04
Name: 840, dtype: float64

pmf.loc[1524]

f    0.06
Name: 1524, dtype: float64

pmf.plot(ylabel='PMF', title='PMF of lightbulb lifetimes', xlabel='Lifetime(hours)') 

<AxesSubplot:title={'center':'PMF of lightbulb lifetimes'}, xlabel='Lifetime(hours)', ylabel='PMF'>

Note

PMF tend to be noisy
Most of bulbs fails between 1000-2000 hours

CDF#

cdf = pmf.cumsum(); cdf.head()

	f
t
0	0.00
840	0.04
852	0.06
936	0.08
960	0.10

cdf.plot(ylabel='CDF', title='CDF of lightbulb lifetimes', xlabel='Lifetime(hours)') 

<AxesSubplot:title={'center':'CDF of lightbulb lifetimes'}, xlabel='Lifetime(hours)', ylabel='CDF'>

Note

CDF : Quickly compute fraction of lightbulbs that expired on or before a time
CDF: Computes percentile
CDF: Not as noisy as PMF
Sigmoid kind of shape
Very few lightbulbs die before 1000 hrs, very few survive after 2000

cdf.loc[840]

f    0.04
Name: 840, dtype: float64

cdf.loc[1524]

f    0.6
Name: 1524, dtype: float64

Survival Function#

Note

Complement of CDF : 1-CDF
CDF is fraction of the population less than or equal to t
survival function if the fraction (strictly) greater than t [ Fraction that exceed a given time t]

surv = 1-cdf; surv.head()

	f
t
0	1.00
840	0.96
852	0.94
936	0.92
960	0.90

fig, ax = plt.subplots(1, figsize=(12,12))
cdf.plot.line(ax=ax, label='CDF')
surv.plot.line(ax=ax, label='Surv', title='Survival Function of lightbulbs', ylabel='Probability(lifetime>t)', xlabel='Lifetime(hours)')

<AxesSubplot:title={'center':'Survival Function of lightbulbs'}, xlabel='Lifetime(hours)', ylabel='Probability(lifetime>t)'>

Hazard Function#

Note

Survival - fraction of lightbulbs that live longer than t
PMF - instantaneous ( here 12 hours window) fraction of lightbulb that fail at time t
PMF+Surv = Fraction that survive until t
Hazard rate fraction of ligthbulbs that survive until time t and then fail at time t. PMF(t)/(PMF(t) +SURV(t))
How much danger are u in

haz = pmf/(pmf+surv); haz.head()

	f
t
0	0.000000
840	0.040000
852	0.020833
936	0.021277
960	0.021739

haz.plot.line( label='Haz', title='Hazard Rate of lightbul', ylabel='Probability of failure at time t (lifetime=t)', xlabel='Lifetime(hours)')

<AxesSubplot:title={'center':'Hazard Rate of lightbul'}, xlabel='Lifetime(hours)', ylabel='Probability of failure at time t (lifetime=t)'>

Probability that lightbulb will die at this time | Given that it has survived till this point.

Cumulative Hazaard#

chaz = haz.cumsum(); chaz.head()

	f
t
0	0.000000
840	0.040000
852	0.060833
936	0.082110
960	0.103849

Note

Steep: Lot of Bulbs dying at this period
Shallow: Rate of dying is low.
Sample on left is more reliable (50 bulbs), on right it’s 1 (single bvlb)-= so not reliable
Left reliable, right statistically noisy

Resampling#

kde = gaussian_kde(pmf.index, weights=pmf.values.flatten()) # Smooth version of pmf distribution function

size = df['f'].sum(); size

sample = kde.resample(size).flatten(); sample.shape

(50,)

kde

<scipy.stats.kde.gaussian_kde at 0x14b07b062f10>

pd.Series(sample).value_counts(normalize=True).sort_index()

733.096533     0.02
781.108500     0.02
785.740480     0.02
811.857852     0.02
817.881074     0.02
977.072959     0.02
979.158138     0.02
999.560712     0.02
1088.361588    0.02
1100.892566    0.02
1143.952057    0.02
1151.903406    0.02
1177.691043    0.02
1181.301166    0.02
1195.982678    0.02
1213.208901    0.02
1216.308848    0.02
1279.357095    0.02
1296.282198    0.02
1321.757254    0.02
1329.753358    0.02
1349.955486    0.02
1369.659052    0.02
1380.253830    0.02
1386.618438    0.02
1416.297723    0.02
1476.864069    0.02
1501.916118    0.02
1503.496298    0.02
1520.309951    0.02
1537.955521    0.02
1538.875501    0.02
1552.270715    0.02
1577.889025    0.02
1627.869528    0.02
1644.854050    0.02
1678.737029    0.02
1696.595028    0.02
1771.676176    0.02
1773.936031    0.02
1779.294043    0.02
1825.795276    0.02
1830.260867    0.02
1969.120723    0.02
1972.005111    0.02
1980.001136    0.02
2007.786535    0.02
2026.912789    0.02
2112.738078    0.02
2126.211653    0.02
dtype: float64

def make_hazard(sample):
    pmf = pd.Series(sample).value_counts(normalize=True).sort_index()
    cdf = pmf.cumsum()
    surv = 1-cdf
    haz = pmf/(pmf+surv)
    return pmf, cdf, surv, haz

for i in range(100):
    sample = kde.resample(size).flatten()
    _,_, sf, _ = make_hazard(sample)
    ax=sf.plot(color='gray', alpha=0.1)
surv.plot(ax=ax, xlabel='Lifetime (hours)', ylabel='Prob(Lifetime>t)', title='Survival function of lightbulb lifetimes')

<AxesSubplot:title={'center':'Survival function of lightbulb lifetimes'}, xlabel='Lifetime (hours)', ylabel='Prob(Lifetime>t)'>

for i in range(100):
    sample = kde.resample(size).flatten()
    _,_, _, hf = make_hazard(sample)
    ax=hf.plot(color='gray', alpha=0.1)
haz.plot(ax=ax, xlabel='Lifetime (hours)', ylabel='Prob(Lifetime>t)', title='Survival function of lightbulb lifetimes')

<AxesSubplot:title={'center':'Survival function of lightbulb lifetimes'}, xlabel='Lifetime (hours)', ylabel='Prob(Lifetime>t)'>

AIBook

Introduction to Survival Analysis

Contents

Introduction to Survival Analysis#

Imports#

Read Dataset#

PMF#

CDF#

Survival Function#

Hazard Function#

Cumulative Hazaard#

Resampling#