Survival Analysis from Scratch
1. The Censoring Problem
Every machine learning model you've built assumes you see the outcome. A classifier sees whether each email is spam. A regression model sees the house price. But what happens when the outcome is when something happens — and for some observations, it hasn't happened yet?
Welcome to survival analysis, the statistical framework for time-to-event data. The classic application is medicine: you enroll patients in a clinical trial, give half of them a new drug, and wait to see who survives longer. But the same math applies everywhere: how long until a customer churns, a hard drive fails, an employee quits, or a user converts.
The concept that makes survival analysis fundamentally different from everything else in ML is censoring. A censored observation is one where you know the event hasn't happened yet, but you don't know when it will. The most common type is right censoring:
- A clinical trial ends on December 31st — patients still alive are right-censored
- A customer is still subscribed when you pull the data — right-censored
- A patient drops out of the study and is lost to follow-up — right-censored
You might think: just drop the censored observations and analyze the rest. Don't. Dropping censored data wastes information (those patients were alive for a known period) and biases your results toward shorter survival times. You might also think: treat censored times as event times. Definitely don't. That underestimates true survival by pretending people died when they were actually still alive.
The only correct approach is to use methods specifically designed for censored data. That's what this post builds, from scratch.
Key assumption: Censoring must be non-informative — the reason someone is censored must be independent of their underlying risk. If sicker patients are more likely to drop out, all standard survival methods break down.
2. Survival and Hazard Functions
Let T be the random variable representing time to event. Three functions describe its distribution, and they're all mathematically linked:
The survival function S(t) = P(T > t) gives the probability of surviving beyond time t. It starts at 1 and monotonically decreases toward 0.
The hazard function h(t) is the instantaneous failure rate — the probability of dying in a tiny interval [t, t+dt), given you've survived to time t:
h(t) = f(t) / S(t)
where f(t) is the probability density. The hazard is a rate, not a probability — it can exceed 1.0. Think of it as "risk intensity at time t."
The cumulative hazard H(t) is the integral of h(t) from 0 to t. The beautiful relationship connecting everything:
S(t) = exp(-H(t))
Different parametric families assume different hazard shapes. The simplest is the exponential distribution with constant hazard h(t) = λ — the memoryless model where your risk never changes. More realistic is the Weibull distribution, which adds a shape parameter k:
k < 1: decreasing hazard (infant mortality — if you survive the early period, you're safer)k = 1: constant hazard (reduces to exponential)k > 1: increasing hazard (aging/wear-out — risk grows over time)
Let's implement these functions and verify the fundamental relationship:
import numpy as np
def weibull_hazard(t, k, lam):
"""Instantaneous hazard h(t) for Weibull distribution."""
return (k / lam) * (t / lam) ** (k - 1)
def weibull_cumulative_hazard(t, k, lam):
"""Cumulative hazard H(t) for Weibull distribution."""
return (t / lam) ** k
def weibull_survival(t, k, lam):
"""Survival function S(t) = exp(-H(t))."""
return np.exp(-weibull_cumulative_hazard(t, k, lam))
# Verify: S(t) = exp(-H(t)) for all parameter settings
t = np.linspace(0.01, 15, 200)
for k, lam in [(0.5, 5), (1.0, 5), (2.0, 5), (1.5, 10)]:
S = weibull_survival(t, k, lam)
H = weibull_cumulative_hazard(t, k, lam)
assert np.allclose(S, np.exp(-H)), f"Failed for k={k}, lam={lam}"
# Show hazard shapes
# k=0.5: h(t) decreases -- early risk, then safer
# k=1.0: h(t) = 0.2 constant -- memoryless (exponential)
# k=2.0: h(t) increases linearly -- aging/wear-out
for k in [0.5, 1.0, 2.0]:
h_vals = weibull_hazard(t, k, lam=5)
print(f"k={k}: h(1)={h_vals[13]:.3f}, h(5)={h_vals[66]:.3f}, h(10)={h_vals[133]:.3f}")
# Output:
# k=0.5: h(1)=0.225, h(5)=0.100, h(10)=0.071
# k=1.0: h(1)=0.200, h(5)=0.200, h(10)=0.200
# k=2.0: h(1)=0.079, h(5)=0.399, h(10)=0.802
3. The Kaplan-Meier Estimator
Parametric models assume a distribution. But what if you don't want to assume anything? The Kaplan-Meier estimator (1958) is the non-parametric workhorse of survival analysis. It estimates the survival curve directly from data, with no distributional assumptions at all.
The idea is the product-limit formula. Let t1 < t2 < ... < tK be the distinct times at which events (deaths) occur. At each event time ti, define:
ni= number at risk just before timeti(alive and uncensored)di= number of events (deaths) at timeti
The Kaplan-Meier estimate is:
Ŝ(t) = ∏ti ≤ t (1 - di / ni)
This is a step function that drops only at event times. Censored observations don't cause a drop — but they do reduce the risk set for subsequent times. That's the key insight: censored subjects contribute the information that they were alive up to their censoring time, then leave the risk set.
To get standard errors, we use Greenwood's formula:
Var(Ŝ(t)) = Ŝ(t)2 × ∑ti ≤ t di / (ni × (ni - di))
Let's walk through a concrete example. Consider 10 patients with these observed times (+ means censored):
| Time | 1 | 2 | 3+ | 4 | 5+ | 6 | 7+ | 8 | 9+ | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Event? | Death | Death | Censored | Death | Censored | Death | Censored | Death | Censored | Death |
At time 1: n1=10 at risk, d1=1 dies. Ŝ(1) = 1 × (1 - 1/10) = 0.900.
At time 2: n2=9 at risk, d2=1. Ŝ(2) = 0.900 × (1 - 1/9) = 0.800.
At time 3: patient is censored. Risk set drops to 7, but no step in the curve.
At time 4: n3=7, d3=1. Ŝ(4) = 0.800 × (1 - 1/7) = 0.686.
And so on. Each censored observation shrinks the denominator for future events.
Here's the full implementation:
import numpy as np
def kaplan_meier(times, events):
"""
Kaplan-Meier estimator with Greenwood standard errors.
Args:
times: array of observed times (event or censoring)
events: array of 0/1 (1 = event occurred, 0 = censored)
Returns:
km_times: event time points (with 0 prepended)
km_surv: survival probabilities at each time
km_se: standard errors (Greenwood's formula)
"""
order = np.argsort(times)
t_sorted = times[order]
e_sorted = events[order]
# Find unique event times (only times where events occur)
event_mask = e_sorted == 1
event_times = np.unique(t_sorted[event_mask])
n_total = len(times)
surv = 1.0
greenwood_sum = 0.0
km_times = [0.0]
km_surv = [1.0]
km_se = [0.0]
# Track position in sorted data for risk set counting
n_at_risk = n_total
idx = 0
for t_i in event_times:
# Remove subjects censored or with events before this time
while idx < n_total and t_sorted[idx] < t_i:
n_at_risk -= 1
idx += 1
# Count events at this exact time
d_i = 0
n_censored_at_t = 0
while idx < n_total and t_sorted[idx] == t_i:
if e_sorted[idx] == 1:
d_i += 1
else:
n_censored_at_t += 1
idx += 1
# Kaplan-Meier update
surv *= (1.0 - d_i / n_at_risk)
# Greenwood variance accumulator
if n_at_risk > d_i:
greenwood_sum += d_i / (n_at_risk * (n_at_risk - d_i))
km_times.append(t_i)
km_surv.append(surv)
km_se.append(surv * np.sqrt(greenwood_sum))
# Remove events and censored-at-same-time from risk set
n_at_risk -= (d_i + n_censored_at_t)
return np.array(km_times), np.array(km_surv), np.array(km_se)
# Example: 10 patients
times = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float)
events = np.array([1, 1, 0, 1, 0, 1, 0, 1, 0, 1]) # 0 = censored
km_t, km_s, km_se = kaplan_meier(times, events)
for t, s, se in zip(km_t, km_s, km_se):
if t > 0:
print(f"t={t:.0f}: S(t)={s:.3f} +/- {se:.3f}")
# Output:
# t=1: S(t)=0.900 +/- 0.095
# t=2: S(t)=0.800 +/- 0.126
# t=4: S(t)=0.686 +/- 0.151
# t=6: S(t)=0.549 +/- 0.172
# t=8: S(t)=0.366 +/- 0.188
# t=10: S(t)=0.000 +/- 0.000
Notice how the standard errors grow as the risk set shrinks — we have less and less information about the tail of the survival curve. This is a fundamental feature of survival analysis: estimates become increasingly uncertain at later times.
4. Comparing Survival Curves — The Log-Rank Test
With Kaplan-Meier, we can estimate a survival curve for each group. But how do we test whether two groups have statistically different survival? This is the log-rank test, the most common hypothesis test in survival analysis.
The idea: at each event time, build a 2×2 contingency table. Under the null hypothesis (same survival in both groups), the expected number of events in group 1 at time tj is:
E1j = n1j × dj / nj
where n1j is the number at risk in group 1, dj is the total events, and nj is the total at risk. Sum the observed-minus-expected across all event times, square it, divide by the total variance, and you get a chi-squared statistic with 1 degree of freedom:
χ2 = (O1 - E1)2 / V
import numpy as np
from scipy.stats import chi2
def log_rank_test(times, events, groups):
"""
Two-sample log-rank test comparing survival between groups.
Args:
times: observed times for all subjects
events: event indicators (1=event, 0=censored)
groups: group labels (0 or 1)
Returns:
chi2_stat: test statistic
p_value: p-value from chi-squared(1) distribution
"""
event_times = np.unique(times[events == 1])
O_minus_E = 0.0
V_total = 0.0
for t_j in event_times:
# Count at-risk and events in each group at this time
at_risk_1 = np.sum((times >= t_j) & (groups == 0))
at_risk_2 = np.sum((times >= t_j) & (groups == 1))
n_j = at_risk_1 + at_risk_2
if n_j == 0:
continue
d_1j = np.sum((times == t_j) & (events == 1) & (groups == 0))
d_2j = np.sum((times == t_j) & (events == 1) & (groups == 1))
d_j = d_1j + d_2j
# Expected events in group 1 under H0
E_1j = at_risk_1 * d_j / n_j
O_minus_E += d_1j - E_1j
# Hypergeometric variance
if n_j > 1:
V_j = (at_risk_1 * at_risk_2 * d_j * (n_j - d_j)) / (n_j**2 * (n_j - 1))
V_total += V_j
chi2_stat = O_minus_E**2 / V_total if V_total > 0 else 0
p_value = 1 - chi2.cdf(chi2_stat, df=1)
return chi2_stat, p_value
# Example: treatment vs control
np.random.seed(42)
times_ctrl = np.random.exponential(scale=5.0, size=30)
times_trt = np.random.exponential(scale=8.0, size=30)
all_times = np.concatenate([times_ctrl, times_trt])
all_events = np.ones(60, dtype=int) # no censoring for simplicity
all_groups = np.array([0]*30 + [1]*30)
stat, pval = log_rank_test(all_times, all_events, all_groups)
print(f"Chi-squared = {stat:.2f}, p-value = {pval:.4f}")
# Output: Chi-squared = 6.90, p-value = 0.0086
An important caveat: the log-rank test is most powerful when the hazard ratio between groups is constant (proportional hazards). If survival curves cross — one group does better early but worse late — the log-rank test can completely miss the difference.
5. Cox Proportional Hazards Model
The Kaplan-Meier estimator and log-rank test handle group comparisons, but they can't model continuous covariates or adjust for confounders. For that, we need the Cox Proportional Hazards model (1972) — arguably the most important model in survival analysis.
The model separates the hazard into a baseline component and a covariate effect:
h(t|X) = h0(t) × exp(βTX)
where h0(t) is an unspecified baseline hazard and β is a coefficient vector. The hazard ratio between two subjects depends only on their covariates, not on time:
h(t|Xi) / h(t|Xj) = exp(βT(Xi - Xj))
This is the proportional hazards assumption: covariates multiply the hazard by a constant factor at all times.
David Cox's key insight was that you can estimate β without ever specifying h0(t). At each event time, the conditional probability that the person who actually died is the one who dies (given that one death occurs among all at-risk subjects) is:
exp(βTXi) / ∑j ∈ R(ti) exp(βTXj)
The baseline hazard cancels out. Multiply these conditional probabilities across all events to get the partial likelihood, then maximize it with gradient ascent. The gradient has a beautiful form: for each event, it's the covariate of the person who died minus the risk-score-weighted average covariate in the risk set.
The implementation trick is the cumsum approach: sort by descending time, then cumulative sums give you risk set totals in O(n) instead of O(n2).
import numpy as np
def cox_gradient(beta, X, times, events):
"""Gradient of negative log partial likelihood."""
order = np.argsort(-times)
X_s = X[order]
e_s = events[order]
eta = X_s @ beta
theta = np.exp(eta - np.max(eta))
cum_theta = np.cumsum(theta)
cum_theta_x = np.cumsum(theta[:, None] * X_s, axis=0)
# Weighted mean covariate in each risk set
x_bar = cum_theta_x / cum_theta[:, None]
grad = np.sum(e_s[:, None] * (X_s - x_bar), axis=0)
return -grad
def fit_cox(X, times, events, lr=0.01, max_iter=1000, tol=1e-8):
"""Fit Cox PH model via gradient descent."""
beta = np.zeros(X.shape[1])
for i in range(max_iter):
g = cox_gradient(beta, X, times, events)
beta -= lr * g
if np.max(np.abs(g)) < tol:
break
return beta
def breslow_baseline_hazard(beta, X, times, events):
"""Breslow estimator for cumulative baseline hazard H0(t)."""
event_times_unique = np.sort(np.unique(times[events == 1]))
risk_scores = np.exp(X @ beta)
H0_times, H0_values = [], []
H0 = 0.0
for t_i in event_times_unique:
d_i = np.sum((times == t_i) & (events == 1))
risk_sum = np.sum(risk_scores[times >= t_i])
H0 += d_i / risk_sum
H0_times.append(t_i)
H0_values.append(H0)
return np.array(H0_times), np.array(H0_values)
# Example: age and treatment as covariates
np.random.seed(99)
n = 200
age = np.random.normal(60, 10, n)
treatment = np.random.binomial(1, 0.5, n)
X = np.column_stack([(age - 60) / 10, treatment]) # standardized
# Generate survival times: higher age = higher risk, treatment = lower risk
true_beta = np.array([0.5, -0.8])
baseline_scale = 10
T = np.random.exponential(baseline_scale * np.exp(-X @ true_beta))
C = np.random.uniform(5, 25, n) # random censoring
observed_time = np.minimum(T, C)
event = (T <= C).astype(int)
beta_hat = fit_cox(X, observed_time, event)
print(f"True beta: [{true_beta[0]:.2f}, {true_beta[1]:.2f}]")
print(f"Estimated beta: [{beta_hat[0]:.2f}, {beta_hat[1]:.2f}]")
print(f"Hazard ratios: age_10yr={np.exp(beta_hat[0]):.2f}, "
f"treatment={np.exp(beta_hat[1]):.2f}")
# Output:
# True beta: [0.50, -0.80]
# Estimated beta: [0.49, -0.74]
# Hazard ratios: age_10yr=1.64, treatment=0.48
# Breslow estimator for baseline survival
H0_t, H0_v = breslow_baseline_hazard(beta_hat, X, observed_time, event)
S0_5 = np.exp(-H0_v[np.searchsorted(H0_t, 5, side='right') - 1])
S0_10 = np.exp(-H0_v[np.searchsorted(H0_t, 10, side='right') - 1])
print(f"Baseline survival: S0(5)={S0_5:.3f}, S0(10)={S0_10:.3f}")
# Output: Baseline survival: S0(5)=0.590, S0(10)=0.432
A hazard ratio of 1.64 for age means each 10-year increase in age multiplies the death hazard by 1.64×. A hazard ratio of 0.48 for treatment means the drug cuts the hazard by 52%. These are exactly the kind of clinically meaningful numbers that make Cox regression so widely used — it's been cited over 60,000 times since 1972. The Breslow estimator gives us the baseline survival curve: a patient with average covariates has a 59% chance of surviving past time 5 and 43% past time 10.
6. Model Evaluation — The Concordance Index
How good is our Cox model at predicting who dies first? The concordance index (Harrell's C) measures discrimination: for every pair of subjects where we can determine who died first, did the model predict a higher risk for the one who died sooner?
- C = 0.5: random guessing (no discrimination)
- C = 1.0: perfect ranking
- C = 0.65–0.75: typical for clinical prediction models
The tricky part: not all pairs are comparable. If both subjects are censored, you can't tell who died first. A pair is usable only if the subject with the earlier time had an observed event.
import numpy as np
def concordance_index(risk_scores, times, events):
"""
Harrell's C-index for survival model discrimination.
Args:
risk_scores: predicted risk (higher = more risk)
times: observed times
events: event indicators (1=event, 0=censored)
Returns:
C-index between 0 and 1
"""
n = len(times)
concordant = 0
discordant = 0
tied = 0
for i in range(n):
for j in range(i + 1, n):
# Check if pair is comparable
if times[i] < times[j]:
if events[i] == 0:
continue # earlier subject censored, not comparable
# Subject i had event first -- should have higher risk
if risk_scores[i] > risk_scores[j]:
concordant += 1
elif risk_scores[i] < risk_scores[j]:
discordant += 1
else:
tied += 1
elif times[j] < times[i]:
if events[j] == 0:
continue
if risk_scores[j] > risk_scores[i]:
concordant += 1
elif risk_scores[j] < risk_scores[i]:
discordant += 1
else:
tied += 1
else: # tied times
if events[i] == 1 and events[j] == 1:
tied += 1
total = concordant + discordant + tied
if total == 0:
return 0.5
return (concordant + 0.5 * tied) / total
# Evaluate our Cox model
risk = np.exp(X @ beta_hat)
c_idx = concordance_index(risk, observed_time, event)
print(f"C-index: {c_idx:.3f}")
# Output: C-index: 0.670
A C-index of 0.67 means our simple two-covariate model correctly ranks who dies first about 67% of the time — well above chance and typical for clinical models. Note that C-index only measures ranking (discrimination), not whether the predicted survival probabilities are well-calibrated.
7. Parametric Survival — Weibull Regression
The Cox model is semi-parametric: it estimates covariate effects without assuming a baseline hazard shape. If you're willing to assume a parametric form, you gain smoother predictions and more efficient estimates. The Weibull model is the most popular parametric choice.
The key to fitting any parametric survival model with censored data is the likelihood. For observation i with time ti and event indicator δi:
Li = f(ti)δi × S(ti)1 - δi
Events contribute the density (probability of dying at exactly ti). Censored observations contribute the survival function (probability of surviving at least to ti). This is the structure that correctly handles censoring.
import numpy as np
from scipy.optimize import minimize
def weibull_neg_log_likelihood(params, times, events):
"""
Negative log-likelihood for Weibull model with censoring.
params: [log_k, log_lambda] (log-transformed for positivity)
"""
k = np.exp(params[0])
lam = np.exp(params[1])
n_events = np.sum(events)
log_lik = (n_events * np.log(k)
- n_events * k * np.log(lam)
+ (k - 1) * np.sum(events * np.log(times))
- np.sum((times / lam) ** k))
return -log_lik
def fit_weibull(times, events):
"""MLE for Weibull distribution with censored data."""
# Initial guess: k=1 (exponential), lambda=median time
x0 = [np.log(1.0), np.log(np.median(times))]
result = minimize(weibull_neg_log_likelihood, x0,
args=(times, events), method='Nelder-Mead')
k_hat = np.exp(result.x[0])
lam_hat = np.exp(result.x[1])
return k_hat, lam_hat
# Fit Weibull to our simulated data (all subjects combined)
k_hat, lam_hat = fit_weibull(observed_time, event)
print(f"Weibull MLE: shape k={k_hat:.2f}, scale lambda={lam_hat:.2f}")
print(f"Median survival: {lam_hat * np.log(2)**(1/k_hat):.1f} time units")
# Output:
# Weibull MLE: shape k=0.93, scale lambda=15.67
# Median survival: 10.5 time units
The estimated shape k ≈ 1.0 suggests a roughly constant hazard — which makes sense because we generated data from an exponential distribution (Weibull with k=1). In real clinical data, k > 1 (increasing hazard) is common for diseases that worsen over time.
Try It: Survival Curve Explorer
Try It: Cox PH Coefficient Playground
8. Conclusion
Survival analysis is the toolkit for a question that arises everywhere: when will something happen, given that we haven't seen it happen for everyone yet? The concept that makes it unique is censoring — and every method we built handles censoring correctly, from the non-parametric Kaplan-Meier estimator through the semi-parametric Cox model to fully parametric Weibull regression.
The progression tells a clear story about the bias-variance tradeoff:
- Kaplan-Meier: No assumptions, but only handles group comparisons
- Log-rank test: Hypothesis testing for differences between KM curves
- Cox regression: Continuous covariates, no baseline hazard assumption, but proportional hazards required
- Weibull regression: Full distributional assumption, but smoother and more efficient
Modern extensions push further: random survival forests handle non-linear effects and interactions, DeepSurv replaces the linear predictor with a neural network, and discrete-time survival models recast survival as a sequence of binary classification problems. But every one of these builds on the foundations we derived here — the likelihood structure with censoring, the risk set concept, and the concordance index for evaluation.
References & Further Reading
- Kaplan & Meier (1958) — "Nonparametric Estimation from Incomplete Observations" — the founding paper of the KM estimator, one of the most cited statistics papers ever
- Cox (1972) — "Regression Models and Life-Tables" — the proportional hazards model and partial likelihood, cited over 60,000 times
- Harrell et al. (1982) — "Evaluating the Yield of Medical Tests" — introduced the concordance index for survival model discrimination
- Peto & Peto (1972) — "Asymptotically Efficient Rank Invariant Test Procedures" — theoretical foundations of the log-rank test
- Klein & Moeschberger — "Survival Analysis: Techniques for Censored and Truncated Data" — comprehensive graduate textbook
- Kalbfleisch & Prentice — "The Statistical Analysis of Failure Time Data" — advanced theoretical treatment
- lifelines Python library — Cameron Davidson-Pilon's excellent survival analysis implementation in Python
- scikit-survival — machine learning-oriented survival analysis compatible with scikit-learn