Causal Inference from Scratch: Why Correlation Really Isn’t Causation

February 27, 2026 · Elementary · 16 min read

1. The Fundamental Problem — Why Correlation Misleads

Ice cream sales and drowning deaths are famously correlated. Nobody thinks ice cream causes drowning — the lurking variable is obvious (summer heat drives both). But in machine learning, the lurking variables are rarely so obvious. Your model discovers that users who visit the premium upsell page have 3× higher lifetime value. Ship a notification pushing everyone to that page? Not so fast. The recommendation engine was already targeting high-value users with that page. The correlation is real. The causal effect might be zero.

This is the gap between seeing and doing. Observing that X and Y move together tells you nothing about what happens when you intervene on X. Machine learning excels at finding associations — patterns in data that predict outcomes. But prediction and causation are fundamentally different tasks. A model that perfectly predicts hospital readmission might learn that “prescribed palliative care” correlates with death, but pulling the palliative care prescription doesn’t save anyone. The prescription is a consequence of terminal illness, not a cause of death.

The most dramatic illustration of this gap is Simpson’s Paradox, where a statistical relationship reverses direction when you account for a confounding variable. Consider a famous real-world example from Charig et al. (1986) — a study of kidney stone treatments.

Subgroup	Treatment A	Treatment B	Winner
Small stones	93.1% (81/87)	86.7% (234/270)	A wins
Large stones	73.0% (192/263)	68.8% (55/80)	A wins
Aggregate	78.0% (273/350)	82.6% (289/350)	B wins!

Treatment A wins in every subgroup but loses in the aggregate. How? The confounder is stone size. Doctors assigned the more aggressive Treatment A to harder cases (large stones), so A’s patient pool was disproportionately difficult. The aggregate comparison mixes together stone size with treatment, producing a misleading reversal.

Judea Pearl formalized this intuition into the Ladder of Causation — three rungs of increasingly powerful reasoning:

Association (seeing): P(Y | X) — “How does seeing X change my belief about Y?” This is what machine learning does.
Intervention (doing): P(Y | do(X)) — “What happens to Y if I set X to a value?” This is what causal inference answers.
Counterfactuals (imagining): “Would Y have been different if X had been different, given what actually happened?” This is individual-level causal reasoning.

Standard ML operates on rung 1. To make decisions — which treatment to prescribe, which feature to ship, which intervention to fund — you need rung 2. Let’s see Simpson’s Paradox in code to make it concrete.

import numpy as np

# Kidney stone treatment data (Charig et al., 1986)
# Rows: [successes, total] for Treatment A and B
small_stones = {'A': (81, 87), 'B': (234, 270)}
large_stones = {'A': (192, 263), 'B': (55, 80)}

def success_rate(successes, total):
    return successes / total * 100

# Within each subgroup, Treatment A wins
print("=== Stratified by Stone Size ===")
for name, data in [("Small stones", small_stones), ("Large stones", large_stones)]:
    rate_a = success_rate(*data['A'])
    rate_b = success_rate(*data['B'])
    print(f"{name}: A={rate_a:.1f}% vs B={rate_b:.1f}%  (A wins by {rate_a - rate_b:+.1f}%)")

# But in aggregate, Treatment B wins!
agg_a = (81 + 192, 87 + 263)   # total A
agg_b = (234 + 55, 270 + 80)   # total B
print(f"\n=== Aggregate (ignoring stone size) ===")
print(f"A={success_rate(*agg_a):.1f}% vs B={success_rate(*agg_b):.1f}%  (B wins!)")

# Why? Treatment assignment is confounded by stone size
print(f"\n=== The Confounder ===")
print(f"A treats {263}/{87+263} = {263/350*100:.0f}% large stones (harder cases)")
print(f"B treats {80}/{270+80} = {80/350*100:.0f}% large stones")
print("Doctors assigned the more aggressive treatment to harder cases.")
print("Stone size confounds the treatment-outcome relationship.")

# Output:
# === Stratified by Stone Size ===
# Small stones: A=93.1% vs B=86.7%  (A wins by +6.4%)
# Large stones: A=73.0% vs B=68.8%  (A wins by +4.2%)
#
# === Aggregate (ignoring stone size) ===
# A=78.0% vs B=82.6%  (B wins!)

The takeaway: if you had used aggregate data to choose a treatment, you’d have picked the worse one. The right answer requires understanding why the data looks the way it does — which variable is a confounder and must be controlled for. Statistics alone can’t tell you this. You need a causal model.

2. Potential Outcomes and the Rubin Causal Model

Before we can fix the problem, we need a precise language for talking about causation. The potential outcomes framework, developed by Donald Rubin (building on Jerzy Neyman’s earlier work), gives us exactly that.

The core idea is deceptively simple. For each individual i and a binary treatment T, there are two potential outcomes:

Y_i(1) — the outcome if individual i receives the treatment
Y_i(0) — the outcome if individual i does not receive the treatment

The individual treatment effect is the difference: τ_i = Y_i(1) − Y_i(0). Did the treatment help this specific person? That’s what τ_i answers.

Here’s the fundamental problem of causal inference, sometimes called “the problem that has no statistical solution” (Holland, 1986): we only ever observe one potential outcome. A patient either takes the drug or doesn’t. A user either sees the new feature or doesn’t. We never get to see the same individual in both states at the same time. The unobserved potential outcome is called the counterfactual.

Since individual effects are unobservable, we aim for the Average Treatment Effect (ATE) across a population:

ATE = E[Y(1) − Y(0)]

This is what we want. What we can actually compute from data is the naive comparison — the difference in observed means between the treated and untreated groups:

Naive estimate = E[Y | T=1] − E[Y | T=0]

Are these the same thing? Only if treatment is randomly assigned. Otherwise, we can decompose the naive estimate into two pieces:

E[Y | T=1] − E[Y | T=0] = ATE + Selection Bias

where Selection Bias = E[Y(0) | T=1] − E[Y(0) | T=0]. This is the baseline difference between the kinds of people who choose treatment versus those who don’t, even without treatment. If healthier people are more likely to exercise, then comparing exercisers to non-exercisers captures both the benefit of exercise and the fact that exercisers were healthier to begin with.

Let’s see this play out in a simulation. We’ll create a job training program where less experienced workers (who earn less) are more likely to enroll — a classic case of negative selection bias.

import numpy as np

np.random.seed(42)
n = 500

# X = years of experience (confounder)
experience = np.random.exponential(scale=5, size=n)

# Potential outcomes (God mode -- we see both)
y0 = 30 + 2 * experience + np.random.normal(0, 3, n)   # salary without training
y1 = y0 + 5                                              # salary with training (true ATE = $5K)

# Selection: less experienced workers MORE likely to seek training
prob_treat = 1 / (1 + np.exp(0.5 * (experience - 4)))
treatment = (np.random.random(n) < prob_treat).astype(int)

# Observed outcomes (fundamental problem: we only see one)
y_observed = treatment * y1 + (1 - treatment) * y0

# Naive estimate: compare treated vs untreated means
naive_ate = y_observed[treatment == 1].mean() - y_observed[treatment == 0].mean()

# True ATE (only possible in simulation)
true_ate = (y1 - y0).mean()

# Selection bias decomposition
# E[Y(0)|T=1] - E[Y(0)|T=0]: baseline difference between groups
selection_bias = y0[treatment == 1].mean() - y0[treatment == 0].mean()

print(f"True ATE:        ${true_ate:.2f}K")
print(f"Naive estimate:  ${naive_ate:.2f}K")
print(f"Selection bias:  ${selection_bias:.2f}K")
print(f"Check: naive ≈ ATE + bias: {true_ate:.2f} + ({selection_bias:.2f}) = {true_ate + selection_bias:.2f}")
print(f"\nTreated group avg experience:   {experience[treatment == 1].mean():.1f} years")
print(f"Untreated group avg experience: {experience[treatment == 0].mean():.1f} years")
print("Less experienced workers self-selected into training,")
print("dragging down the treated group's observed outcomes.")

# Output:
# True ATE:        $5.00K
# Naive estimate:  $-2.14K   (WRONG SIGN!)
# Selection bias:  $-7.14K

Look at that result. The training program genuinely helps — it boosts salary by $5K. But the naive estimate says training hurts by $2K. The wrong sign! Why? Because less experienced workers (who earn less regardless of training) self-selected into the program. Their lower baseline salaries dragged down the treated group’s average, creating $7K of negative selection bias that overwhelmed the real $5K benefit.

This is exactly what happens in the real world. Companies see that users who contact customer support have higher churn — but it’s not because support causes churn. Users with problems (who would churn anyway) are more likely to contact support. The selection bias is baked into the observational data, and no amount of clever feature engineering on rung 1 of Pearl’s ladder can remove it.

Randomized experiments (A/B tests) solve this by making treatment assignment independent of potential outcomes, driving selection bias to zero in expectation. But randomization isn’t always possible — you can’t randomly assign people to smoke for 30 years, or randomly assign cities to adopt a policy. For those cases, we need the tools in the sections ahead. And the first tool is understanding which variables create selection bias — the confounders.

3. Confounders and the Backdoor Criterion

We’ve seen that confounders can flip the sign of a causal estimate. But how do we systematically identify confounders? And once we’ve found them, how do we know which ones to control for? Not all variables that correlate with treatment and outcome should be conditioned on. Getting this wrong can create bias where none existed.

The answer comes from causal graphs — directed acyclic graphs (DAGs) where arrows represent direct causal effects. There are three fundamental structures you need to know, and they determine everything about what to condition on.

The Fork (Confounder)

Structure: X ← Z → Y. The variable Z causes both X and Y, creating a spurious association between them. In the kidney stone example, stone size (Z) causes both treatment choice (X) and recovery (Y). If you observe the data without conditioning on Z, X and Y appear associated — but it’s entirely because of Z. Action: condition on Z to block the spurious path and reveal the true causal effect of X on Y.

The Chain (Mediator)

Structure: X → Z → Y. The variable Z is the mechanism through which X affects Y. For example, a drug (X) changes blood chemistry (Z), which affects recovery (Y). Here X and Y are associated because of a genuine causal pathway flowing through Z. Action: do NOT condition on Z. If you do, you block the very causal pathway you’re trying to measure. You’d conclude the drug has no effect, which is wrong — it works precisely through that mediator.

The Collider

Structure: X → Z ← Y. Two independent causes both affect Z. The classic example: talent (X) and attractiveness (Y) both influence getting into Hollywood (Z). Talent and attractiveness are independent in the general population. But among Hollywood actors (conditioning on the collider Z), they become negatively associated — less talented actors who made it tend to be more attractive, and vice versa. This is the “explain-away” effect. Action: do NOT condition on Z. Conditioning on a collider opens a spurious path that didn’t exist before.

Here’s the critical insight: forks and chains behave identically under observation (both create association between X and Y), but they require opposite actions. You must condition on forks and must not condition on chains. Statistics alone cannot distinguish these two structures — the same observed correlation pattern could arise from either. You need the causal graph, which comes from domain knowledge, not from data.

Pearl’s Backdoor Criterion formalizes this into a clear algorithm. A set of variables Z satisfies the backdoor criterion for estimating the causal effect of X on Y if:

No variable in Z is a descendant of X (you don’t condition on mediators or colliders downstream of treatment).
Z blocks every “backdoor path” — every path from X to Y that has an arrow pointing into X (these are the confounding paths).

When the backdoor criterion is satisfied, the causal effect is identifiable from observational data via the adjustment formula: P(Y | do(X)) = Σ_Z P(Y | X, Z) P(Z). In other words, condition on Z, compute the effect within each stratum, and average over the population distribution of Z. This is exactly what we did with the kidney stones — we stratified by stone size and looked at the effect within each stratum.

Let’s implement a simple causal DAG class that can check the backdoor criterion.

class CausalDAG:
    """A directed acyclic graph for causal reasoning."""
    def __init__(self):
        self.edges = {}  # parent -> list of children
        self.nodes = set()

    def add_edge(self, parent, child):
        self.nodes.update([parent, child])
        self.edges.setdefault(parent, []).append(child)

    def parents(self, node):
        return [p for p, children in self.edges.items() if node in children]

    def descendants(self, node):
        desc = set()
        queue = self.edges.get(node, [])
        while queue:
            current = queue.pop(0)
            if current not in desc:
                desc.add(current)
                queue.extend(self.edges.get(current, []))
        return desc

    def is_backdoor(self, treatment, outcome, conditioning_set):
        """Check if conditioning_set satisfies the backdoor criterion."""
        # Rule 1: No node in conditioning set is a descendant of treatment
        treatment_desc = self.descendants(treatment)
        for node in conditioning_set:
            if node in treatment_desc:
                return False, "Conditioning on a descendant of treatment"

        # Rule 2: conditioning_set blocks all backdoor paths
        # (Simplified: check that all parents of treatment are blocked)
        backdoor_nodes = self.parents(treatment)
        for bd in backdoor_nodes:
            if bd not in conditioning_set:
                return False, f"Unblocked backdoor through '{bd}'"
        return True, "Backdoor criterion satisfied"

# Example 1: Fork (Confounder) -- Stone Size -> Treatment, Stone Size -> Outcome
dag = CausalDAG()
dag.add_edge("StoneSize", "Treatment")
dag.add_edge("StoneSize", "Recovery")
dag.add_edge("Treatment", "Recovery")

valid, msg = dag.is_backdoor("Treatment", "Recovery", {"StoneSize"})
print(f"Fork -- condition on StoneSize: {valid} ({msg})")

valid, msg = dag.is_backdoor("Treatment", "Recovery", set())
print(f"Fork -- condition on nothing:   {valid} ({msg})")

# Example 2: Chain (Mediator) -- Treatment -> Dosage -> Recovery
dag2 = CausalDAG()
dag2.add_edge("Treatment", "Dosage")
dag2.add_edge("Dosage", "Recovery")
print(f"\nChain -- no confounders, empty set works:")
valid, msg = dag2.is_backdoor("Treatment", "Recovery", set())
print(f"  Condition on nothing: {valid} ({msg})")
print("  (Do NOT condition on Dosage -- it's a mediator, not a confounder)")

# Example 3: Collider -- Talent -> Hollywood, Looks -> Hollywood
dag3 = CausalDAG()
dag3.add_edge("Talent", "Hollywood")
dag3.add_edge("Looks", "Hollywood")
print(f"\nCollider -- Talent and Looks are independent:")
print(f"  But conditioning on Hollywood creates a spurious association!")
print(f"  (Among Hollywood actors, less talented ones tend to be better looking)")

Notice how the code ties directly back to Simpson’s Paradox from Section 1. The kidney stone scenario is exactly the fork pattern: StoneSize is the confounder that causes both Treatment assignment and Recovery. The backdoor criterion tells us we must condition on StoneSize to identify the causal effect — which is precisely what stratification did. When we looked at the data within each stone-size subgroup, we were blocking the backdoor path through the confounder, and the true effect (Treatment A is better) emerged.

The beauty of the backdoor criterion is that it’s mechanical. Given a causal graph, you don’t need to reason informally about “controlling for stuff.” You draw your DAG, identify the backdoor paths, and the criterion tells you exactly which variables to condition on. The hard part — and the part no algorithm can do for you — is getting the DAG right. That requires domain knowledge: understanding which variables cause which.

Now that we have the conceptual tools (potential outcomes tell us what we want to estimate, causal graphs tell us what to condition on), we’re ready for the practical methods. The next section introduces propensity score matching — a technique that uses the backdoor criterion to extract causal estimates from messy observational data, even when there are many confounders to juggle simultaneously.

4. Propensity Score Matching — Quasi-Experiments from Observational Data

Sometimes you can’t run an A/B test. Ethics prevent you from randomly withholding a medical treatment. Cost prevents you from randomizing a factory retooling across plants. Logistics prevent you from randomly assigning cities to a new policy. But you still have observational data — treated and untreated groups that chose their own treatment status. Can you extract a causal estimate from this mess?

Propensity score matching (PSM), introduced by Rosenbaum and Rubin (1983), offers an elegant answer. The propensity score is the probability of receiving treatment given observed covariates: e(X) = P(T=1 | X). The key theorem: if you match treated and untreated individuals who have the same propensity score, the treatment assignment is approximately random within those matched pairs — just as if you had run an experiment. Instead of matching on dozens of confounders simultaneously (the curse of dimensionality), you collapse them all into a single number.

Three assumptions underpin the method, and they deserve scrutiny:

Unconfoundedness (a.k.a. “ignorability”): conditional on observed covariates X, treatment assignment is independent of potential outcomes. This is untestable — you can never prove there isn’t a lurking unobserved confounder. It’s a leap of faith justified by domain knowledge.
Overlap (common support): for every combination of covariates, there must be some probability of both treatment and non-treatment. Formally, 0 < P(T=1 | X) < 1. If certain types of people always get treated, there’s no untreated match for them.
SUTVA (Stable Unit Treatment Value Assumption): one individual’s treatment doesn’t affect another’s outcome. No interference, no spillovers. This fails in social networks (if your friend gets vaccinated, you benefit too) but holds in many practical settings.

The recipe is straightforward: fit a logistic regression to predict treatment from confounders, compute propensity scores, match each treated unit to the nearest untreated unit by propensity score, and estimate the Average Treatment Effect on the Treated (ATT) from matched pairs. Let’s implement it from scratch.

import numpy as np

np.random.seed(42)
n = 1000

# Confounders
experience = np.random.exponential(5, n)
education = np.random.normal(14, 2, n)  # years of schooling

# Treatment assignment (self-selection based on confounders)
logit = -1.5 + 0.1 * experience + 0.15 * education
prob_treat = 1 / (1 + np.exp(-logit))
treatment = (np.random.random(n) < prob_treat).astype(int)

# Potential outcomes
y0 = 25 + 1.8 * experience + 2.5 * education + np.random.normal(0, 4, n)
y1 = y0 + 4  # true ATE = $4K
y_obs = treatment * y1 + (1 - treatment) * y0

# Step 1: Estimate propensity scores via logistic regression
X = np.column_stack([np.ones(n), experience, education])
beta = np.zeros(3)
lr = 0.01
for _ in range(1000):  # gradient ascent
    z = X @ beta
    p = 1 / (1 + np.exp(-np.clip(z, -20, 20)))
    gradient = X.T @ (treatment - p) / n
    beta += lr * gradient

propensity = 1 / (1 + np.exp(-np.clip(X @ beta, -20, 20)))

# Step 2: Nearest-neighbor matching on propensity score
treated_idx = np.where(treatment == 1)[0]
control_idx = np.where(treatment == 0)[0]

matched_treated = []
matched_control = []
for t in treated_idx:
    distances = np.abs(propensity[t] - propensity[control_idx])
    best_match = control_idx[np.argmin(distances)]
    matched_treated.append(y_obs[t])
    matched_control.append(y_obs[best_match])

# Step 3: Estimate ATT from matched pairs
naive_ate = y_obs[treatment == 1].mean() - y_obs[treatment == 0].mean()
matched_att = np.mean(np.array(matched_treated) - np.array(matched_control))
true_ate = 4.0

print(f"True ATE:         ${true_ate:.2f}K")
print(f"Naive estimate:   ${naive_ate:.2f}K  (biased by self-selection)")
print(f"PSM estimate:     ${matched_att:.2f}K  (bias reduced!)")
print(f"\nBias reduction:   {abs(naive_ate - true_ate) - abs(matched_att - true_ate):.2f}K closer to truth")
print(f"Treated group:    {len(treated_idx)} workers")
print(f"Matched pairs:    {len(matched_treated)}")

The propensity score matching estimate lands dramatically closer to the true $4K effect than the naive comparison. By matching treated workers to untreated workers with similar propensity scores, we’ve created an approximate experiment from observational data. The logistic regression learned that experience and education drive treatment selection, and the matching balanced these confounders across groups.

But PSM has a fundamental limitation: it can only handle observed confounders. If there’s an unmeasured variable — say, motivation — that affects both who seeks training and how much they earn, PSM can’t fix the bias. The unconfoundedness assumption is doing all the heavy lifting, and it’s untestable. You can never be sure you’ve measured everything that matters.

What if we could exploit a natural event — a policy change, a product launch, a geographic boundary — that creates treated and control groups without the selection problem? That’s the idea behind difference-in-differences.

5. Difference-in-Differences — Natural Experiments

Sometimes nature hands you a quasi-experiment. A new regulation hits some states but not others. A company rolls out a feature to certain cities before the rest. A minimum wage increase affects one side of a state border. These “natural experiments” create treated and control groups without anyone deliberately randomizing, and difference-in-differences (DiD) is the workhorse method for analyzing them.

The idea is elegant. You observe two groups (treated and control) across two time periods (before and after the intervention). The DiD estimator is:

DiD = (Y_treat,after − Y_treat,before) − (Y_{control,after} − Y_{control,before})

The first difference (within the treated group) removes all time-invariant confounders — things like city size, demographics, or baseline revenue that are fixed. But it still includes any common time trends (the economy grew, seasonality shifted). The second difference — subtracting the control group’s change — removes those common trends. What’s left is the treatment effect.

The key assumption is parallel trends: absent treatment, both groups would have followed the same trajectory over time. You don’t need them to have the same levels (treated cities can be richer than control cities), just the same trend. This is untestable post-treatment, but you can check it pre-treatment — if the two groups moved in parallel before the intervention, it’s plausible they would have continued to do so.

Let’s see it in action. Imagine a company that rolls out a new ML recommendation model to 25 cities but not the other 25, and we measure monthly revenue.

import numpy as np

np.random.seed(42)
n_cities = 50
n_periods = 8  # 4 pre-treatment + 4 post-treatment
treatment_start = 4

# City-level panel data
treated_cities = np.arange(n_cities // 2)
control_cities = np.arange(n_cities // 2, n_cities)

# Baseline revenue differs between groups (treated cities are bigger)
baseline = np.zeros(n_cities)
baseline[treated_cities] = 100  # treated cities start higher
baseline[control_cities] = 80   # control cities start lower

# Common time trend (both groups grow similarly)
time_trend = np.arange(n_periods) * 2.0  # $2K/period growth

# True treatment effect: $8K revenue boost after rollout
true_effect = 8.0

# Generate panel data
revenue = np.zeros((n_cities, n_periods))
for t in range(n_periods):
    noise = np.random.normal(0, 3, n_cities)
    revenue[:, t] = baseline + time_trend[t] + noise
    if t >= treatment_start:
        revenue[treated_cities, t] += true_effect

# DiD estimation
treat_before = revenue[treated_cities, :treatment_start].mean()
treat_after = revenue[treated_cities, treatment_start:].mean()
ctrl_before = revenue[control_cities, :treatment_start].mean()
ctrl_after = revenue[control_cities, treatment_start:].mean()

did_estimate = (treat_after - treat_before) - (ctrl_after - ctrl_before)

print(f"=== Difference-in-Differences ===")
print(f"Treated cities:  before={treat_before:.1f}  after={treat_after:.1f}  change={treat_after-treat_before:+.1f}")
print(f"Control cities:  before={ctrl_before:.1f}  after={ctrl_after:.1f}  change={ctrl_after-ctrl_before:+.1f}")
print(f"\nFirst diff (treated):  {treat_after - treat_before:+.1f}")
print(f"First diff (control):  {ctrl_after - ctrl_before:+.1f}")
print(f"DiD estimate:          {did_estimate:+.1f}")
print(f"True effect:           {true_effect:+.1f}")

# Naive comparison (ignoring pre-treatment differences)
naive = treat_after - ctrl_after
print(f"\nNaive post-only:       {naive:+.1f}  (inflated by baseline gap!)")
print("DiD removes the baseline difference between groups.")

# Check parallel trends assumption (pre-treatment periods)
pre_trends_treat = [revenue[treated_cities, t].mean() for t in range(treatment_start)]
pre_trends_ctrl = [revenue[control_cities, t].mean() for t in range(treatment_start)]
gaps = [pre_trends_treat[t] - pre_trends_ctrl[t] for t in range(treatment_start)]
print(f"\nParallel trends check (pre-treatment gaps):")
print(f"  Gaps: {['%.1f' % g for g in gaps]}")
print(f"  Stable gap ≈ {np.mean(gaps):.1f} confirms parallel trends assumption")

The naive post-treatment comparison gives roughly +$28K — a wildly inflated number because it includes the $20K baseline gap between larger treated cities and smaller control cities. DiD correctly strips that away: the treated cities grew by about $16K (time trend + treatment effect) while the control cities grew by about $8K (time trend only), giving a DiD estimate of approximately $8K — right on target.

The parallel trends check confirms the key assumption: the gap between treated and control cities stayed stable around $20K throughout the pre-treatment periods. Both groups were growing at the same rate before the intervention, making it plausible that any divergence afterward is due to the new recommendation model rather than some other city-level change.

DiD is powerful because it handles time-invariant unobserved confounders — anything fixed about a city (population, geography, local culture) cancels out in the differencing. But it can’t handle time-varying confounders (what if treated cities also got a new marketing campaign at the same time?) and it relies on the untestable parallel trends assumption holding post-treatment. For situations where even time-varying unobserved confounders are a concern, we need something stronger.

6. Instrumental Variables — When Confounders Are Unobservable

Propensity score matching requires observing all confounders. Difference-in-differences requires parallel trends. But what if there’s an unmeasured confounder that no amount of conditioning can handle? This is where instrumental variables (IV) come in — perhaps the most conceptually elegant (and most frequently misused) tool in the causal inference arsenal.

The classic problem: estimating the return to education on earnings. People with more education earn more, but is that because education causes higher earnings, or because innate ability (unobserved) drives both? More able people stay in school longer and would earn more regardless. OLS regression can’t separate the two effects because ability is unmeasured.

An instrument Z is a variable that:

Relevance: Z correlates with the treatment T. This is testable — the first-stage F-statistic should exceed 10 (Staiger and Stock, 1997).
Exclusion restriction: Z affects the outcome Y only through T, not directly. This is untestable and must be argued on substantive grounds.
Independence: Z is uncorrelated with the unobserved confounders. In our example, Z must be unrelated to ability.

Card (1995) used proximity to college as an instrument for education. Growing up near a college makes attending cheaper, increasing the probability of attending (relevance). Geographic proximity is plausibly unrelated to innate ability (independence). And proximity affects earnings only through its effect on education, not directly (exclusion). These are debatable but defensible.

The estimation method is Two-Stage Least Squares (2SLS): in Stage 1, regress treatment on the instrument to get predicted treatment; in Stage 2, regress the outcome on predicted treatment. By using only the variation in education that comes from the instrument (geographic proximity), we isolate variation that’s uncorrelated with ability.

One important subtlety: IV estimates the Local Average Treatment Effect (LATE) — the causal effect for compliers (people whose education changed because of proximity), not the ATE for the whole population. This is a narrower but still valuable estimate.

import numpy as np

np.random.seed(42)
n = 2000

# Unobserved confounder: innate ability
ability = np.random.normal(0, 1, n)

# Instrument: proximity to college (0 = far, 1 = near)
# Independent of ability (geographic accident)
near_college = (np.random.random(n) < 0.4).astype(float)

# Education (affected by ability AND proximity)
education = 12 + 2 * ability + 1.5 * near_college + np.random.normal(0, 1, n)

# Earnings (affected by education AND ability directly)
true_return = 3.0  # true causal return to education: $3K per year
earnings = 20 + true_return * education + 5 * ability + np.random.normal(0, 5, n)

# OLS regression (biased: ability confounds education-earnings)
X_ols = np.column_stack([np.ones(n), education])
beta_ols = np.linalg.lstsq(X_ols, earnings, rcond=None)[0]

# 2SLS: Instrumental Variables using college proximity
# Stage 1: Regress education on instrument
X_iv = np.column_stack([np.ones(n), near_college])
gamma = np.linalg.lstsq(X_iv, education, rcond=None)[0]
education_hat = X_iv @ gamma  # predicted education from instrument

# Stage 2: Regress earnings on predicted education
X_2sls = np.column_stack([np.ones(n), education_hat])
beta_2sls = np.linalg.lstsq(X_2sls, earnings, rcond=None)[0]

# First-stage F-statistic (instrument relevance check)
ss_resid = np.sum((education - X_iv @ gamma) ** 2)
ss_total = np.sum((education - education.mean()) ** 2)
r_squared = 1 - ss_resid / ss_total
f_stat = (r_squared / 1) / ((1 - r_squared) / (n - 2))

print(f"True causal return to education: ${true_return:.2f}K/year")
print(f"OLS estimate:                    ${beta_ols[1]:.2f}K/year  (biased upward)")
print(f"2SLS IV estimate:                ${beta_2sls[1]:.2f}K/year  (consistent)")
print(f"\nOLS bias: ability is a confounder (smarter people get more education")
print(f"AND earn more, inflating the apparent return to education)")
print(f"\nFirst-stage F-statistic: {f_stat:.1f}  ({'Strong' if f_stat > 10 else 'Weak'} instrument)")
print(f"Rule of thumb: F > 10 indicates a strong instrument")

OLS estimates the return to education at roughly $5K per year — substantially above the true $3K. The upward bias comes from ability: smarter people get more education and earn more independently, so OLS conflates the two effects. The 2SLS estimate using college proximity as an instrument lands much closer to the truth, because it uses only the “exogenous” variation in education — the part driven by geographic accident rather than by ability.

The first-stage F-statistic confirms we have a strong instrument (well above the Staiger-Stock threshold of 10). A weak instrument would amplify any small violation of the exclusion restriction, producing wildly unstable estimates — the so-called “weak instrument problem.”

The fundamental limitation of IV is that good instruments are rare and the exclusion restriction is untestable. You must argue, not test, that proximity to college affects earnings only through education. If proximity also affects earnings through, say, access to urban job markets, the exclusion restriction fails and the IV estimate is biased. This is why IV results are often debated vigorously in empirical research — the instrument’s validity rests on an untestable assumption.

We’ve built the full toolkit — from understanding what can go wrong (Simpson’s paradox) through formal frameworks (potential outcomes, DAGs) to practical estimation methods (PSM, DiD, IV). Now let’s make two of these concepts tangible with interactive visualizations.

Try It: Simpson’s Paradox Explorer

Watch how a confounder can reverse the apparent direction of an effect. Within each subgroup the treatment helps — but in aggregate it appears to hurt.

Confounder Strength: 0.7 Show Subgroups

Try It: Propensity Score Matching

See how matching on propensity scores pairs treated and untreated units to reduce selection bias. Click Match to watch the algorithm find nearest neighbors.

Selection Bias: 0.6

References & Further Reading

Pearl — Causality (Cambridge University Press, 2009) — The foundational text on causal graphs, do-calculus, and structural causal models.
Rubin — “Causal Inference Using Potential Outcomes” (JASA, 2005) — Formalizes the potential outcomes framework and connects it to randomized experiments.
Angrist & Pischke — Mostly Harmless Econometrics (Princeton, 2009) — The practical guide to IV, DiD, and regression discontinuity in applied research.
Hernán & Robins — Causal Inference: What If (Chapman & Hall, 2020) — Free online textbook covering the full spectrum from DAGs to time-varying treatments. Available here.
Cunningham — Causal Inference: The Mixtape (Yale, 2021) — Free online with code examples in R and Stata. Available here.
Holland — “Statistics and Causal Inference” (JASA, 1986) — Named the “Fundamental Problem of Causal Inference.”
Card — “Using Geographic Variation in College Proximity to Estimate the Return to Schooling” (1995) — Classic instrumental variables application.
Rosenbaum & Rubin — “The Central Role of the Propensity Score in Observational Studies for Causal Effects” (Biometrika, 1983) — Introduced propensity score methods.

Conclusion

We started with a paradox — the same data telling opposite stories depending on how you slice it — and ended with a toolkit for extracting causal answers from messy observational data. Along the way we built the key abstractions from scratch: potential outcomes that define what a causal effect is, directed acyclic graphs that formalize which variables confound which, propensity scores that turn observational data into quasi-experiments, difference-in-differences that exploit natural experiments, and instrumental variables that handle unobserved confounders.

The deepest lesson is that every one of these methods rests on untestable assumptions. PSM assumes no unobserved confounders. DiD assumes parallel trends. IV assumes the exclusion restriction. No statistical test can confirm these assumptions — they require domain knowledge, the kind of substantive expertise about how the world works that no algorithm can replace. The DAG is where statistics meets domain knowledge: you draw the arrows based on what you know about the system, and the math tells you what you can identify. Get the arrows wrong and no amount of clever estimation will save you. Get them right and you can answer questions that pure prediction never could: not just what will happen, but what would happen if we intervened.