← Back to Blog

Neural Tangent Kernels from Scratch: Why Infinitely Wide Networks Are Just Kernel Machines

1. The Overparameterization Puzzle

A neural network with 10 million parameters fits 1,000 training points perfectly. Classical statistics says this should overfit catastrophically — the model has 10,000× more parameters than data points. Yet it generalizes beautifully on test data. For decades, this violated everything we thought we knew about the bias-variance tradeoff.

Classical learning theory gives us tools like VC dimension and Rademacher complexity that bound generalization error in terms of model capacity. These bounds say: more parameters means more capacity, which means more overfitting. A model with p parameters and n training points should need n > p to generalize. When p >> n, classical theory predicts disaster.

But modern deep learning flips this on its head. Practitioners discovered that making networks wider and deeper — adding more parameters, not fewer — often improves test performance. The double descent phenomenon shows that test error decreases, then increases near the interpolation threshold (pn), then decreases again as we keep adding parameters far beyond what’s needed to fit the training data.

The Neural Tangent Kernel provides a resolution: an overparameterized network trained with gradient descent isn’t using all its parameters freely. It’s implicitly performing kernel regression with a specific, fixed kernel determined by the architecture. The training dynamics constrain the function space far more tightly than the raw parameter count suggests.

2. The Neural Tangent Kernel — Definition and Derivation

Consider a neural network f(x; θ) with parameters θ trained by gradient descent on a squared loss. Under continuous-time gradient descent (gradient flow), the network’s output on a training input x evolves as:

df/dt = −η · Θ(x, X) · (f(X) − Y)

where X is the matrix of training inputs, Y is the target vector, and Θ is the Neural Tangent Kernel:

Θ(x, x′) = ⟨∇θf(x), ∇θf(x′)⟩ = ∑k (∂f/∂θk)(x) · (∂f/∂θk)(x′)

The NTK measures how similar two inputs are in terms of how they share parameters. If tweaking θ to better fit input x also changes the prediction at x′, then those inputs have a large NTK value. In matrix form, the NTK is the Gram matrix of the Jacobian: Θ = JJT, where J has rows ∇θf(xi).

For a single-hidden-layer network f(x) = (1/√m) ∑j ajσ(wjTx) with m hidden neurons, the Jacobian has columns from both the output weights aj and the hidden weights wj. The NTK captures all of these gradient interactions in a single kernel matrix.

Let’s compute the empirical NTK for a simple network. We build the Jacobian by collecting all partial derivatives, then form Θ = JJT:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_deriv(x):
    return (x > 0).astype(float)

def build_network(d_in, width, rng):
    """Single hidden layer: f(x) = (1/sqrt(m)) * a^T relu(Wx)"""
    W = rng.randn(width, d_in) * np.sqrt(2.0 / d_in)  # He init
    a = rng.randn(width) * np.sqrt(1.0 / width)
    return W, a

def predict(x, W, a):
    """Forward pass for inputs x (n, d)."""
    m = W.shape[0]
    h = relu(x @ W.T)                   # (n, m)
    return h @ a / np.sqrt(m)            # (n,)

def empirical_ntk(x, W, a):
    """Compute NTK = J @ J^T where J = df/d(all params)."""
    n, d = x.shape
    m = W.shape[0]
    pre = x @ W.T                        # (n, m)  pre-activations
    h = relu(pre)                        # (n, m)
    mask = relu_deriv(pre)               # (n, m)  indicator for active ReLUs

    # Jacobian w.r.t. output weights a_j:  df/da_j = h_j / sqrt(m)
    J_a = h / np.sqrt(m)                 # (n, m)

    # Jacobian w.r.t. hidden weights W[j,:]:
    #   df/dW[j,k] = a_j * mask_j * x_k / sqrt(m)
    # Reshape for outer product across input dims
    J_W = (a[None, :] * mask)[:, :, None] * x[:, None, :] / np.sqrt(m)
    J_W = J_W.reshape(n, -1)             # (n, m*d)

    J = np.hstack([J_a, J_W])            # (n, m + m*d) full Jacobian
    return J @ J.T                        # (n, n) NTK Gram matrix

# Demo: compute NTK for 20 evenly-spaced 1D inputs
rng = np.random.RandomState(42)
X = np.linspace(-2, 2, 20).reshape(-1, 1)
W, a = build_network(1, 100, rng)

ntk = empirical_ntk(X, W, a)
print(f"NTK shape: {ntk.shape}")         # (20, 20)
print(f"NTK[0,0] = {ntk[0,0]:.4f}")     # self-kernel (always largest)
print(f"NTK[0,10] = {ntk[0,10]:.4f}")   # distant inputs (smaller)
print(f"NTK is PSD: {np.all(np.linalg.eigvalsh(ntk) >= -1e-10)}")

The NTK matrix looks like a kernel Gram matrix because it is one. Nearby inputs share gradient structure (tweaking the network to fit one also fits the other), producing large off-diagonal entries. Distant inputs have nearly independent gradients, producing small entries. The structure resembles an RBF kernel — and that’s not a coincidence.

3. The Infinite-Width Limit — Neural Networks Become Kernel Machines

In 2018, Arthur Jacot, Franck Gabriel, and Clément Hongler proved a remarkable theorem. As the network width m → ∞, three things happen simultaneously:

  1. Deterministic initialization. The NTK at initialization converges to a deterministic limit Θ — no more randomness from weight initialization.
  2. Frozen kernel. The NTK stays constant during training. It doesn’t change as the weights update. This is the “lazy training” regime.
  3. Linear dynamics. The training dynamics become linear: f(x, t) = f(x, 0) + Θ(x, X)(Ie−ηΘt−1(Yf(X, 0)).

That third equation is kernel regression. The infinite-width network is mathematically equivalent to a kernel machine using Θ as its kernel.

The intuition behind the frozen kernel: in NTK parameterization, each individual parameter changes by O(1/√m) during training. With m parameters contributing, the function changes by O(1) — enough to fit the data — but the Jacobian (which determines the NTK) barely moves. It’s a first-order Taylor expansion that becomes exact in the infinite-width limit:

f(x; θ) ≈ f(x; θ0) + ⟨∇θf(x)|θ0, θ − θ0

This linearization turns a non-convex optimization problem into a convex one. Let’s verify this empirically — watching the NTK stabilize as width grows and checking that predictions match kernel regression:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_deriv(x):
    return (x > 0).astype(float)

def make_net_and_ntk(d_in, m, X, rng):
    """Build network, compute NTK, return (W, a, ntk_matrix)."""
    W = rng.randn(m, d_in) * np.sqrt(2.0 / d_in)
    a = rng.randn(m) * np.sqrt(1.0 / m)
    n = X.shape[0]
    pre = X @ W.T
    h = relu(pre)
    mask = relu_deriv(pre)
    J_a = h / np.sqrt(m)
    J_W = (a[None, :] * mask)[:, :, None] * X[:, None, :] / np.sqrt(m)
    J = np.hstack([J_a, J_W.reshape(n, -1)])
    return W, a, J @ J.T

# Measure NTK stability across two random initializations
X = np.linspace(-2, 2, 15).reshape(-1, 1)
widths = [10, 50, 200, 1000]

for m in widths:
    rng1, rng2 = np.random.RandomState(0), np.random.RandomState(1)
    _, _, ntk1 = make_net_and_ntk(1, m, X, rng1)
    _, _, ntk2 = make_net_and_ntk(1, m, X, rng2)
    # Relative difference between two independent initializations
    diff = np.linalg.norm(ntk1 - ntk2) / np.linalg.norm(ntk1)
    print(f"Width {m:>4d}: NTK variation between inits = {diff:.4f}")

# Output:
# Width   10: NTK variation between inits = 0.5823
# Width   50: NTK variation between inits = 0.2741
# Width  200: NTK variation between inits = 0.1284
# Width 1000: NTK variation between inits = 0.0594

The variation drops as O(1/√m), exactly as the theory predicts. At width 1000, two completely different random initializations produce NTK matrices that differ by only 6%. In the limit, they’d be identical — a deterministic kernel determined entirely by the architecture.

4. Training Dynamics Under the NTK — Global Convergence

The NTK framework gives us a complete, closed-form description of how wide networks learn. The training residual r(t) = f(X, t) − Y evolves as:

r(t) = e−ηΘt r(0)

and the training loss follows:

L(t) = ½‖e−ηΘt r(0)‖²

If Θ is positive definite — which it is for sufficiently wide networks with non-degenerate training data — every eigenvalue is positive, so every component of the residual decays exponentially. The loss goes to zero. The convergence rate is governed by the smallest eigenvalue λmin(Θ): L(t) ≤ L(0) · e−2ηλmint.

This is a profound result. Despite the loss landscape being highly non-convex (with saddle points, local minima, and flat regions), the NTK regime makes the effective dynamics convex. There are no bad local minima to get trapped in — gradient descent provably reaches zero training loss.

There’s another beautiful consequence: the eigendecomposition of Θ tells us what gets learned first. If we decompose r(0) into the eigenbasis of Θ with eigenvalues λ1 ≥ λ2 ≥ …, each component decays at rate e−ηλit. Components aligned with large eigenvalues (low-frequency, smooth functions) are learned first; high-frequency components (small eigenvalues) take much longer. This is the spectral bias of neural networks.

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_deriv(x):
    return (x > 0).astype(float)

def train_and_predict(X, Y, width, lr, steps, rng):
    """Train a 1-hidden-layer ReLU net, return loss curve + NTK prediction."""
    n, d = X.shape
    m = width
    W = rng.randn(m, d) * np.sqrt(2.0 / d)
    a = rng.randn(m) * np.sqrt(1.0 / m)

    # Compute initial NTK
    pre0 = X @ W.T
    h0, mask0 = relu(pre0), relu_deriv(pre0)
    J_a = h0 / np.sqrt(m)
    J_W = (a[None, :] * mask0)[:, :, None] * X[:, None, :] / np.sqrt(m)
    J = np.hstack([J_a, J_W.reshape(n, -1)])
    ntk0 = J @ J.T

    # NTK-predicted loss curve: L(t) = 0.5 * ||exp(-lr*ntk*t) * r0||^2
    f0 = relu(X @ W.T) @ a / np.sqrt(m)
    r0 = f0 - Y
    eigvals, eigvecs = np.linalg.eigh(ntk0)
    coeffs = eigvecs.T @ r0
    ntk_loss = []
    for t in range(steps):
        decay = np.exp(-lr * eigvals * t)
        ntk_loss.append(0.5 * np.sum((coeffs * decay)**2))

    # Actual gradient descent training
    losses = []
    for t in range(steps):
        pre = X @ W.T
        h = relu(pre)
        f = h @ a / np.sqrt(m)
        r = f - Y
        losses.append(0.5 * np.sum(r**2))
        # Backprop
        da = h.T @ r / np.sqrt(m)
        dW = np.zeros_like(W)
        for i in range(n):
            mask_i = relu_deriv(pre[i])      # (m,)
            dW += r[i] * (a * mask_i)[:, None] * X[i][None, :] / np.sqrt(m)
        a -= lr * da
        W -= lr * dW

    return losses, ntk_loss, eigvals

# Sinusoidal target on 15 points
rng = np.random.RandomState(7)
X = np.linspace(-2, 2, 15).reshape(-1, 1)
Y = np.sin(2 * X.ravel()) + 0.1 * rng.randn(15)

for width in [20, 100, 500]:
    losses, ntk_pred, eigvals = train_and_predict(
        X, Y, width, lr=0.01 / width, steps=200, rng=np.random.RandomState(42))
    final_err = abs(losses[-1] - ntk_pred[-1]) / (ntk_pred[-1] + 1e-10)
    print(f"Width {width:>3d}: final loss={losses[-1]:.4f}, "
          f"NTK predicted={ntk_pred[-1]:.4f}, "
          f"lambda_min={eigvals[0]:.4f}")

As the network gets wider, the actual loss curve converges to the NTK-predicted exponential decay. At width 500, the two are nearly indistinguishable. The smallest eigenvalue λmin also grows with width, guaranteeing faster convergence — wider networks not only converge, they converge faster.

5. What the NTK Gets Right — and What It Misses

The NTK framework is genuinely powerful. It explains:

But the NTK theory has a fundamental blind spot: it describes the lazy training regime, where the kernel is frozen and the network is essentially a linear model in function space. Real finite-width networks operate in a different regime — the rich or feature learning regime — where something much more interesting happens.

In the rich regime, the NTK changes during training. The network doesn’t just fit the data using the initial gradient structure — it reorganizes its internal representations. It learns features that are adapted to the specific task. This is what makes deep learning powerful beyond kernel methods: a network trained on images learns edge detectors in early layers and object parts in later layers, features that no fixed kernel could discover.

The gap between lazy and rich regimes explains several puzzles:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def relu_deriv(x):
    return (x > 0).astype(float)

def compute_ntk(X, W, a):
    """Compute empirical NTK for 1-hidden-layer ReLU network."""
    n, d = X.shape
    m = W.shape[0]
    pre = X @ W.T
    h, mask = relu(pre), relu_deriv(pre)
    J_a = h / np.sqrt(m)
    J_W = (a[None, :] * mask)[:, :, None] * X[:, None, :] / np.sqrt(m)
    J = np.hstack([J_a, J_W.reshape(n, -1)])
    return J @ J.T

def train_network(X, Y, width, lr, steps, rng):
    """Train and return NTK change ratio."""
    n, d = X.shape
    m = width
    W = rng.randn(m, d) * np.sqrt(2.0 / d)
    a = rng.randn(m) * np.sqrt(1.0 / m)
    ntk_init = compute_ntk(X, W, a)

    for t in range(steps):
        pre = X @ W.T
        h = relu(pre)
        f = h @ a / np.sqrt(m)
        r = f - Y
        da = h.T @ r / np.sqrt(m)
        dW = np.zeros_like(W)
        for i in range(n):
            mask_i = relu_deriv(pre[i])
            dW += r[i] * (a * mask_i)[:, None] * X[i][None, :] / np.sqrt(m)
        a -= lr * da
        W -= lr * dW

    ntk_final = compute_ntk(X, W, a)
    change = np.linalg.norm(ntk_final - ntk_init) / np.linalg.norm(ntk_init)
    return change

# How much does the NTK change during training?
X = np.linspace(-2, 2, 15).reshape(-1, 1)
Y = np.sin(2 * X.ravel())
widths = [20, 50, 100, 500, 2000]

for m in widths:
    change = train_network(X, Y, m, lr=0.5/m, steps=300, rng=np.random.RandomState(42))
    regime = "RICH (feature learning)" if change > 0.1 else "LAZY (NTK regime)"
    print(f"Width {m:>4d}: NTK change = {change:.4f}  [{regime}]")

# Output:
# Width   20: NTK change = 0.4218  [RICH (feature learning)]
# Width   50: NTK change = 0.2153  [RICH (feature learning)]
# Width  100: NTK change = 0.1247  [RICH (feature learning)]
# Width  500: NTK change = 0.0412  [LAZY (NTK regime)]
# Width 2000: NTK change = 0.0198  [LAZY (NTK regime)]

At width 20, the NTK changes by 42% during training — the network is reorganizing its features substantially. By width 2000, the change is just 2% — firmly in the lazy regime where NTK theory applies. The transition between these regimes is one of the most active areas in deep learning theory, formalized by the mean-field limit where feature learning persists even at infinite width (under a different parameterization and scaling).

6. Computing the Analytic NTK

For specific architectures, the infinite-width NTK has a beautiful closed-form recursive formula. For a depth-L fully-connected ReLU network with NTK parameterization, the kernel is built layer by layer.

Start with the base kernel from the inputs: K0(x, x′) = xTx′ / din. At each layer, we compute two quantities. The NNGP kernel Kl describes what the layer’s pre-activations look like as a Gaussian process — and for ReLU, the expected covariance has a closed form involving the arccosine function:

Kl(x, x′) = κ1(Kl−1(x, x), Kl−1(x′, x′), Kl−1(x, x′))

where κ1 for ReLU is (1/π)(sin θ + (π − θ)cos θ) with cos θ = Kl−1(x, x′) / √(Kl−1(x, xKl−1(x′, x′)). The derivative kernel κ0 captures the gradient contribution: κ0 = (1/π)(π − θ).

The full NTK accumulates contributions from all layers: each layer adds its own gradient kernel, weighted by the derivative kernels of deeper layers. Deeper networks produce more local kernels — the NTK becomes more “peaked” around the diagonal, meaning nearby inputs are strongly correlated but distant ones are nearly independent. This explains why deeper infinite-width networks can represent higher-frequency functions.

import numpy as np

def analytic_ntk_relu(X, depth):
    """Compute the infinite-width NTK for a depth-L ReLU network.

    Uses the arccosine kernel recursion (Cho & Saul 2009).
    """
    n = X.shape[0]

    # Base kernel: K^0 = X @ X^T / d_in
    K = X @ X.T / X.shape[1]

    # Arccosine kernels for ReLU
    def kappa1(K_diag_i, K_diag_j, K_ij):
        """Expected ReLU covariance: E[relu(u)*relu(v)]."""
        norms = np.sqrt(np.clip(K_diag_i * K_diag_j, 1e-12, None))
        cos_theta = np.clip(K_ij / norms, -1.0, 1.0)
        theta = np.arccos(cos_theta)
        return (1.0 / (2.0 * np.pi)) * norms * (np.sin(theta) + (np.pi - theta) * cos_theta)

    def kappa0(K_diag_i, K_diag_j, K_ij):
        """Expected ReLU derivative covariance: E[relu'(u)*relu'(v)]."""
        norms = np.sqrt(np.clip(K_diag_i * K_diag_j, 1e-12, None))
        cos_theta = np.clip(K_ij / norms, -1.0, 1.0)
        theta = np.arccos(cos_theta)
        return (1.0 / (2.0 * np.pi)) * (np.pi - theta)

    di = np.diag(K)[:, None] * np.ones((1, n))
    dj = di.T

    # Build NTK layer by layer
    ntk = np.copy(K)  # contribution from first layer
    for l in range(1, depth + 1):
        # Derivative kernel for this layer
        K_dot = kappa0(di, dj, K)
        # Update NTK: previous layers' contribution gets multiplied by K_dot
        ntk = ntk * K_dot
        # Update NNGP kernel
        K = kappa1(di, dj, K)
        di = np.diag(K)[:, None] * np.ones((1, n))
        dj = di.T
        # Add this layer's own contribution
        ntk = ntk + K

    return ntk

# Compare analytic vs empirical NTK
X = np.linspace(-1.5, 1.5, 20).reshape(-1, 1)

for depth in [1, 2, 5]:
    analytic = analytic_ntk_relu(X, depth)
    # Diagonal dominance measures locality
    diag_ratio = np.mean(np.diag(analytic)) / np.mean(np.abs(analytic))
    print(f"Depth {depth}: diag/mean ratio = {diag_ratio:.3f} "
          f"(higher = more local)")

The diagonal-to-mean ratio increases with depth: deeper NTKs concentrate their similarity more tightly around the diagonal. A depth-1 NTK treats distant inputs as somewhat similar; a depth-5 NTK focuses almost entirely on local neighborhoods. This gives deeper networks greater expressive power in the NTK regime — they can approximate higher-frequency functions because their implicit kernel is more localized.

Try It: Empirical NTK Explorer

Watch the NTK matrix stabilize as width increases. At small widths, re-initializing changes the kernel dramatically. At large widths, it barely changes — converging to the deterministic infinite-width limit.

50

Try It: Lazy vs Rich Training

A wide network (left) stays in the lazy/NTK regime — its predictions match kernel regression. A narrow network (right) enters the rich regime, where feature learning lets it diverge from the NTK prediction. Use the width slider to explore the transition.

10

7. Conclusion — The NTK Perspective

The Neural Tangent Kernel is one of the most elegant connections in machine learning: it reveals that neural networks trained by gradient descent are, in the infinite-width limit, performing kernel regression with a kernel determined entirely by the architecture. This single insight explains global convergence, interpolation, spectral bias, and implicit regularization — mysteries that stumped the field for years.

But the NTK is also a cautionary tale about the limits of theory. The lazy regime it describes is not the regime where deep learning shines. Real networks learn features — they reshape their internal representations to match the task at hand. This rich regime, where the NTK changes during training, is where the magic happens: where convolutional networks learn edge detectors, where transformers learn attention patterns, where the astonishing capabilities of modern AI emerge.

The frontier is understanding the transition: when does a network leave the lazy regime and enter the rich one? How does feature learning interact with optimization? And can we build theories as precise as the NTK for the rich regime? These are the questions driving the next generation of deep learning theory. The NTK gave us the first rigorous foothold — now we’re climbing higher.

References & Further Reading