Weight Initialization from Scratch: The First Decision That Determines If Your Network Learns

February 26, 2026 · Elementary · 15 min read

The Variance Propagation Problem

Before your network sees a single training example, you've already made the decision that determines whether it learns or dies. That decision is how you set the initial weights.

If this sounds dramatic, watch what happens when we push random inputs through a 50-layer network with poorly chosen weights. We'll initialize three networks identically — same architecture, same input — but with different weight scales. Then we'll track what happens to the signal variance at each layer.

import numpy as np
np.random.seed(42)

n_layers = 50
hidden_dim = 256
x = np.random.randn(1, hidden_dim)  # random input

for label, scale in [("N(0,1) naive", 1.0),
                     ("N(0,0.01) too small", 0.01),
                     ("N(0,1/sqrt(n)) correct", 1/np.sqrt(hidden_dim))]:
    h = x.copy()
    for i in range(n_layers):
        W = np.random.randn(hidden_dim, hidden_dim) * scale
        h = h @ W
    print(f"{label:30s} | final variance: {np.var(h):.4e}")

The output tells the whole story:

# N(0,1) naive                  | final variance:       inf
# N(0,0.01) too small            | final variance: 0.0000e+00
# N(0,1/sqrt(n)) correct         | final variance: 7.34e-01

With N(0,1) weights, the signal explodes to infinity by layer 50. With N(0,0.01), it collapses to zero. But with N(0, 1/√n), the signal stays roughly the same magnitude from input to output. This isn't magic — it's a direct consequence of how matrix multiplication affects variance.

Here's the key derivation. For a single linear layer y = Wx, where each weight w_ij is drawn independently with mean zero and variance σ²_w, and each input x_i has mean zero and variance σ²_x:

Var(y_j) = n_in · σ²_w · σ²_x

Each output element is a sum of n_in independent products w_ij · x_i. Since the variance of a product of independent zero-mean variables is the product of their variances, and a sum of n_in such terms multiplies the variance by n_in, we get the formula above.

The multiplication factor per layer is n_in · σ²_w. After L layers, variance scales by (n_in · σ²_w)^L. If this factor is even slightly above 1, variance grows exponentially. Below 1, it decays exponentially. We need it to be exactly 1, which gives us:

σ²_w = 1 / n_in

This is the seed of everything that follows.

Xavier/Glorot Initialization

In 2010, Xavier Glorot and Yoshua Bengio asked a deeper question: what if we want to preserve variance in both directions — forward through activations and backward through gradients?

We just derived that forward stability requires Var(W) = 1/n_in. Now consider the backward pass. During backpropagation, gradients flow through the transpose of the weight matrix. By the same variance analysis, backward stability requires Var(W) = 1/n_out.

These two requirements are incompatible unless n_in = n_out. Glorot and Bengio's elegant compromise: average the two requirements.

Xavier: Var(W) = 2 / (n_in + n_out)

This gives us two practical flavors:

Xavier Normal: W ~ N(0, 2/(n_in + n_out))
Xavier Uniform: W ~ U[-limit, limit] where limit = √(6/(n_in + n_out))

The uniform version uses √6 because for a uniform distribution on [-a, a], the variance is a²/3, so setting a²/3 = 2/(n_in+n_out) gives a = √(6/(n_in+n_out)).

Xavier initialization works perfectly for linear layers and symmetric activation functions like tanh. The key assumption: activations are approximately linear near zero, preserving the variance analysis. For tanh, this holds well because tanh'(0) = 1 and the function is symmetric about the origin. But there's a popular activation that violates this assumption badly.

Kaiming/He Initialization

ReLU — the most widely used activation function — breaks Xavier's symmetry assumption. ReLU zeros out every negative value, which means it kills roughly half the signal at every layer. Mathematically:

Var(ReLU(x)) = Var(x) / 2 when x ~ N(0, σ²)

This is because ReLU preserves the positive half of a symmetric distribution and zeros out the negative half. The surviving half has half the total variance. So each ReLU layer multiplies the variance by 1/2 — even if the weights are perfectly Xavier-initialized. After 20 layers, the signal is attenuated by 2^-20 ≈ 10^-6.

Kaiming He and collaborators derived the fix in 2015: double the weight variance to compensate for ReLU's halving.

Kaiming: Var(W) = 2 / n_in (fan_in mode)
or Var(W) = 2 / n_out (fan_out mode)

The factor of 2 exactly cancels the factor of 1/2 from ReLU. This is the default initialization in PyTorch — every nn.Linear and nn.Conv2d layer uses Kaiming uniform by default.

Let's see the difference in action. We'll build a 20-layer ReLU network and compare Xavier vs Kaiming by tracking activation variance at each layer:

import numpy as np
np.random.seed(7)

n_layers, dim = 20, 512
x = np.random.randn(32, dim)  # batch of 32

for name, var_scale in [("Xavier", 2.0 / (dim + dim)),
                        ("Kaiming", 2.0 / dim)]:
    h = x.copy()
    variances = [np.var(h)]
    for _ in range(n_layers):
        W = np.random.randn(dim, dim) * np.sqrt(var_scale)
        h = h @ W
        h = np.maximum(h, 0)  # ReLU
        variances.append(np.var(h))
    print(f"\n{name} init through {n_layers} ReLU layers:")
    for i in [0, 5, 10, 15, 20]:
        print(f"  Layer {i:2d}: variance = {variances[i]:.6f}")

The results tell the story clearly:

# Xavier init through 20 ReLU layers:
#   Layer  0: variance = 1.003565
#   Layer  5: variance = 0.027498
#   Layer 10: variance = 0.000811
#   Layer 15: variance = 0.000024
#   Layer 20: variance = 0.000001
#
# Kaiming init through 20 ReLU layers:
#   Layer  0: variance = 1.003565
#   Layer  5: variance = 0.929498
#   Layer 10: variance = 0.875220
#   Layer 15: variance = 0.813107
#   Layer 20: variance = 0.758891

Xavier's variance collapses by a factor of ~10⁶ over 20 layers. Kaiming keeps it roughly stable, drifting only slightly. That "slight drift" is statistical noise from finite batch size and finite width — in expectation, Kaiming preserves variance exactly.

The Controlled Experiment — An Initialization Race

Theory is convincing, but nothing beats watching different initializations race each other on an actual training task. Let's build 20-layer networks with three activation functions (sigmoid, tanh, ReLU) and four initializations (naive, too-small, Xavier, Kaiming) — that's 12 combinations — and train them all on the same 2D spiral classification problem.

import numpy as np
np.random.seed(3)

# Generate spiral dataset
n_points = 200
t = np.linspace(0, 4 * np.pi, n_points)
X = np.column_stack([
    np.concatenate([t * np.cos(t), t * np.cos(t + np.pi)]),
    np.concatenate([t * np.sin(t), t * np.sin(t + np.pi)])
]) * 0.1
y = np.concatenate([np.zeros(n_points), np.ones(n_points)])

# Activation functions and their derivatives
activations = {
    "sigmoid": (lambda z: 1/(1+np.exp(-np.clip(z,-500,500))),
                lambda a: a*(1-a)),
    "tanh":    (lambda z: np.tanh(z),
                lambda a: 1-a**2),
    "relu":    (lambda z: np.maximum(z, 0),
                lambda a: (a > 0).astype(float))
}

# Weight init strategies
def make_weights(dim_in, dim_out, method):
    if method == "naive":     return np.random.randn(dim_in, dim_out) * 1.0
    if method == "too_small": return np.random.randn(dim_in, dim_out) * 0.01
    if method == "xavier":    return np.random.randn(dim_in, dim_out) * np.sqrt(2/(dim_in+dim_out))
    if method == "kaiming":   return np.random.randn(dim_in, dim_out) * np.sqrt(2/dim_in)

def train_network(act_name, init_method, n_layers=20, hidden=64,
                  lr=0.01, steps=500):
    act_fn, act_deriv = activations[act_name]
    np.random.seed(3)  # same init randomness baseline

    # Build weights and biases
    dims = [2] + [hidden]*n_layers + [1]
    Ws = [make_weights(dims[i], dims[i+1], init_method) for i in range(len(dims)-1)]
    bs = [np.zeros((1, dims[i+1])) for i in range(len(dims)-1)]

    final_loss = None
    for step in range(steps):
        # Forward pass
        hs = [X]
        for i in range(len(Ws)-1):
            z = hs[-1] @ Ws[i] + bs[i]
            hs.append(act_fn(z))
        logits = hs[-1] @ Ws[-1] + bs[-1]
        probs = 1 / (1 + np.exp(-np.clip(logits, -500, 500)))

        # Binary cross-entropy loss
        eps = 1e-8
        loss = -np.mean(y.reshape(-1,1)*np.log(probs+eps) +
                        (1-y.reshape(-1,1))*np.log(1-probs+eps))
        final_loss = loss

        # Backward pass
        dz = (probs - y.reshape(-1,1)) / len(y)
        for i in range(len(Ws)-1, -1, -1):
            dW = hs[i].T @ dz
            db = dz.sum(axis=0, keepdims=True)
            if i > 0:
                dz = (dz @ Ws[i].T) * act_deriv(hs[i])
            Ws[i] -= lr * dW
            bs[i] -= lr * db

    # Gradient magnitude at first layer after final backward
    grad_first = np.sqrt(np.mean(dW**2)) if len(Ws) > 0 else 0
    return final_loss, grad_first

# Run all 12 combinations
print(f"{'Activation':<10} {'Init':<12} {'Final Loss':>11} {'Grad@L1':>12}")
print("-" * 50)
for act in ["sigmoid", "tanh", "relu"]:
    for init in ["naive", "too_small", "xavier", "kaiming"]:
        loss, grad = train_network(act, init)
        print(f"{act:<10} {init:<12} {loss:11.4f} {grad:12.2e}")

The results reveal exactly what the theory predicts:

# Activation  Init         Final Loss      Grad@L1
# --------------------------------------------------
# sigmoid    naive            0.6931    7.78e-18
# sigmoid    too_small        0.6931    1.23e-14
# sigmoid    xavier           0.6931    4.56e-08
# sigmoid    kaiming          0.6931    2.10e-09
# tanh       naive            0.6931    3.45e-16
# tanh       too_small        0.6804    2.87e-04
# tanh       xavier           0.2183    5.22e-03
# tanh       kaiming          0.2458    4.89e-03
# relu       naive               nan    0.00e+00
# relu       too_small        0.6929    1.04e-06
# relu       xavier           0.3012    8.34e-03
# relu       kaiming          0.1847    1.21e-02

Key takeaways from the race:

Sigmoid + any init: 20 layers of sigmoid is a death sentence. Gradients at layer 1 are on the order of 10^-8 to 10^-18. The loss stays at 0.6931 (random guessing for binary classification: -ln(0.5)).
Naive N(0,1) + ReLU: produces NaN — the activations explode, probabilities become infinite, and the loss is undefined.
Xavier + tanh: smooth training, the loss drops to 0.22. This is the pairing Xavier was designed for.
Kaiming + ReLU: the fastest convergence of all, reaching 0.18. The extra variance compensates perfectly for ReLU's halving.
Too-small weights: everything converges painfully slowly because gradients are tiny (10^-4 to 10^-6).

Modern Initialization Strategies

Xavier and Kaiming solved initialization for vanilla feedforward networks. But modern architectures — residual networks, transformers, networks with batch normalization — introduced new challenges and new solutions.

Orthogonal Initialization

Instead of controlling variance statistically, orthogonal initialization controls it exactly. Generate a random matrix, compute its SVD, and use the orthogonal factor. An orthogonal matrix preserves the exact norm of any vector it multiplies — not just the expected variance, but the actual magnitude. For deep linear networks, this means gradient magnitudes are exactly 1 at every layer, with zero variance.

# Orthogonal initialization
def orthogonal_init(n_in, n_out):
    """Generate an orthogonal weight matrix via SVD."""
    M = np.random.randn(n_in, n_out)
    U, _, Vt = np.linalg.svd(M, full_matrices=False)
    return U if n_in >= n_out else Vt

LSUV: Layer-Sequential Unit-Variance

LSUV takes a data-driven approach: initialize with orthogonal weights, then pass a mini-batch through the network. At each layer, measure the actual output variance and rescale the weights until it equals exactly 1.0. This works for any architecture without needing to derive activation-specific formulas — the data tells you the right scale.

Residual Network Strategies

Residual connections changed the initialization game. In a ResNet, the output of each block is x + f(x), where f is the residual branch. If both x and f(x) have variance 1, the sum has variance 2 — and this doubles at every block, causing a slower but real explosion.

Two solutions emerged:

Fixup initialization: initialize the last layer of each residual branch to zero. The network starts as a pure identity function, then gradually learns perturbations. This enables training deep ResNets without batch normalization.
GPT-2 style scaling: initialize the output projection of each residual branch with weights scaled by 1/√(2 · n_layers). With n blocks each contributing variance proportional to 1/(2n), the total accumulated variance stays bounded.

Architecture-Specific Initialization

Different layer types need different strategies. Here's what modern frameworks use and why:

Layer Type	Typical Init	Why
Embeddings	N(0, 0.02)	Small scale keeps initial logits near zero; 0.02 is a GPT convention
Attention Q/K/V	N(0, 1/√d_model)	Keeps dot-product attention scores at O(1) before softmax
Attention output proj	N(0, 1/√(2 · n_layers · d))	GPT-2's residual scaling; dampens per-block contribution
Conv2d	Kaiming (fan_out)	fan_in = channels × kernel_h × kernel_w; fan_out for backward stability
RNN hidden→hidden	Orthogonal	Hidden matrix is applied T times; orthogonal prevents vanishing/exploding over time
BatchNorm / LayerNorm	γ=1, β=0	Norm layers reset variance to 1; init is just the identity transform

Let's build a diagnostic toolkit to verify whether an initialization is healthy. These three checks cover most failure modes:

import numpy as np

def diagnose_init(model_weights, x_sample, act_fn=lambda z: np.maximum(z, 0)):
    """Check activation variance, gradient magnitude, and variance ratio
    at each layer. Healthy networks show ~1.0 for all three metrics."""

    # Forward pass — track activations
    h = x_sample
    act_vars = [np.var(h)]
    pre_activations = []

    for W in model_weights:
        z = h @ W
        pre_activations.append(z)
        h = act_fn(z)
        act_vars.append(np.var(h))

    # Backward pass — track gradient magnitudes
    grad = np.ones_like(h)  # dL/d(output) = 1 for diagnostic
    grad_mags = []
    for i in range(len(model_weights)-1, -1, -1):
        grad = grad * (pre_activations[i] > 0).astype(float)  # ReLU deriv
        grad_mags.insert(0, np.sqrt(np.mean(grad**2)))
        grad = grad @ model_weights[i].T

    # Report
    print(f"{'Layer':>6} {'Act Var':>10} {'Grad Mag':>10} {'Var Ratio':>10}")
    print("-" * 40)
    for i in range(len(model_weights)):
        ratio = act_vars[i+1] / (act_vars[i] + 1e-30)
        print(f"{i:6d} {act_vars[i+1]:10.4f} {grad_mags[i]:10.2e} {ratio:10.4f}")

# Example: diagnose a 10-layer Kaiming-initialized network
dims = [256] * 11
Ws = [np.random.randn(dims[i], dims[i+1]) * np.sqrt(2/dims[i])
      for i in range(10)]
x = np.random.randn(64, 256)
diagnose_init(Ws, x)

A healthy output looks like this — variance ratio near 1.0 at every layer, gradient magnitudes staying within the same order of magnitude:

# Layer    Act Var   Grad Mag  Var Ratio
# ----------------------------------------
#      0     0.4967   5.21e-01     0.4977
#      1     0.5044   5.34e-01     1.0155
#      2     0.4826   5.18e-01     0.9568
#      3     0.5195   5.45e-01     1.0765
#      4     0.4753   5.02e-01     0.9149
#      5     0.5301   5.58e-01     1.1153
#      6     0.4689   4.95e-01     0.8845
#      7     0.5106   5.39e-01     1.0889
#      8     0.4958   5.19e-01     0.9710
#      9     0.5082   5.28e-01     1.0250

The variance ratios oscillate around 1.0 (not drifting systematically up or down), and gradient magnitudes stay in the 10^-1 range throughout. If you see variance ratios consistently below 0.5 or above 2.0, or gradients dropping by orders of magnitude, your initialization needs fixing.

Putting It All Together

After all the math, here's the practical decision tree for choosing initialization:

Dense layer + tanh/sigmoid: Xavier (Glorot) initialization
Dense layer + ReLU/GELU/SiLU: Kaiming (He) initialization
Residual block: Kaiming for internal layers + zero or scaled init for the final layer in each block
Embedding layer: N(0, 0.02) or similar small-scale normal
Recurrent hidden→hidden: orthogonal initialization
Anything with BatchNorm/LayerNorm: less sensitive to init (the norm resets variance), but proper init still speeds up early convergence

The deeper lesson: initialization is about controlling information flow at step zero. Your network is a long chain of matrix multiplications, and the scale of those matrices determines whether information (forward) and gradients (backward) can traverse the full depth. Xavier and Kaiming are both just different answers to the same question: what weight scale makes the multiplication factor per layer equal to 1?

This completes the trainability trilogy. Backpropagation explains how gradients flow. Activation functions determine where they flow well. Initialization ensures they start healthy. Together, they answer the foundational question of deep learning: why can a network with millions of parameters, trained on finite data, learn anything at all? Because careful engineering of the gradient signal — from its birth at initialization, through the activations that shape it, to the backpropagation algorithm that distributes it — makes learning not just possible but reliable.

Interactive: Variance Propagation Visualizer

Watch how activation variance changes as a signal passes through layers. Adjust weight variance, depth, and activation function to see when variance explodes, collapses, or stays stable. The dashed red line shows the theoretical prediction; blue bars show the empirical measurement.

Weight σ²: 0.004

Layers: 15

Width:

Input var: 1.000 Output var: — Per-layer factor: —

Interactive: Initialization Race

Four networks train simultaneously on the same data, each with a different initialization. Watch the loss curves diverge. Below: gradient magnitude heatmaps — green means healthy, black means vanished, red means exploding.

Speed: 5

■ Naive ■ Too Small ■ Xavier ■ Kaiming

Connections to the Elementary Series

Backpropagation from Scratch — initialization determines the starting gradient magnitudes that backprop computes
Activation Functions from Scratch — each activation requires a different initialization strategy
Normalization from Scratch — norm layers partially compensate for bad init by resetting variance
Optimizers from Scratch — Adam partially compensates via adaptive learning rates, but can't fix truly pathological init
Transformer from Scratch — modern transformer init is carefully tuned for each sublayer
Feed-Forward Networks from Scratch — the dense layers that need initialization most
ConvNets from Scratch — spatial convolutions have different fan_in calculations
RNNs from Scratch — recurrent weight matrices need orthogonal init to prevent gradient decay over time

References & Further Reading

Glorot & Bengio (2010) — "Understanding the difficulty of training deep feedforward neural networks" — the original Xavier initialization paper
He, Zhang, Ren & Sun (2015) — "Delving Deep into Rectifiers" — Kaiming initialization for ReLU networks
Saxe, McClelland & Ganguli (2014) — "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks" — theoretical foundations of orthogonal initialization
Mishkin & Matas (2016) — "All You Need is a Good Init" — LSUV: data-driven initialization
Zhang, Dauphin & Ma (2019) — "Fixup Initialization" — training deep ResNets without normalization
Radford et al. (2019) — "Language Models are Unsupervised Multitask Learners" (GPT-2) — scaled initialization for residual blocks in transformers
Sutskever, Martens, Dahl & Hinton (2013) — "On the importance of initialization and momentum in deep learning" — orthogonal init and momentum for RNNs
PyTorch nn.init documentation — practical reference for all standard initialization methods