Kernel Methods from Scratch: The Trick That Lets Linear Models Learn Nonlinear Patterns

February 27, 2026 · Elementary · 14 min read

1. The Feature Map Problem — When Linear Isn't Enough

Your data isn't linearly separable. The classic fix: map it to a higher-dimensional space where it is. A 2D circle becomes a 3D paraboloid that a hyperplane can slice cleanly. But what if the right feature space is 1,000-dimensional? Or infinite-dimensional? Computing those features explicitly would take forever — or be literally impossible. The kernel trick says: you don't have to.

Consider an XOR-like dataset in 2D. Four clusters sit at the corners of a square, with labels determined by which quadrant they occupy: top-right and bottom-left are class +1, top-left and bottom-right are class −1. No straight line can separate these classes. But if we add a third feature — the product x⋅x′ of the two coordinates — the data lifts into 3D space where a flat plane cleanly divides the classes.

The problem? This gets expensive fast. A polynomial feature map of degree d on n input features produces C(n+d, d) output features. Degree 2 on 10 features gives 66 dimensions. Degree 5 gives 3,003. Degree 10 gives 184,756. And the RBF kernel — the most popular kernel in practice — implicitly maps to an infinite-dimensional space. You can't compute infinite features. Or can you?

import numpy as np
from itertools import combinations_with_replacement
from collections import Counter
from math import comb, factorial

# Generate XOR-like data: label = sign(x1 * x2)
rng = np.random.RandomState(42)
X = rng.randn(200, 2)
y = np.sign(X[:, 0] * X[:, 1])  # +1 in Q1/Q3, -1 in Q2/Q4

# Explicit degree-2 polynomial feature map
# [x1, x2] -> [x1^2, sqrt(2)*x1*x2, x2^2, sqrt(2)*x1, sqrt(2)*x2, 1]
def poly_features(X, degree=2):
    n, d = X.shape
    features = []
    for deg in range(degree + 1):
        for combo in combinations_with_replacement(range(d), deg):
            col = np.ones(n)
            for idx in combo:
                col *= X[:, idx]
            # Multinomial coefficient for proper inner product
            counts = Counter(combo)
            coeff = np.sqrt(factorial(deg) /
                    np.prod([factorial(c) for c in counts.values()]))
            features.append(col * coeff)
    return np.column_stack(features)

phi = poly_features(X, degree=2)  # 200 x 6

# Linear classifier in lifted space: ridge regression
lam = 0.01
w = np.linalg.solve(phi.T @ phi + lam * np.eye(phi.shape[1]), phi.T @ y)
preds = np.sign(phi @ w)
print(f"Accuracy in lifted space: {np.mean(preds == y):.1%}")  # ~100%

# Dimensionality explosion: 10 input features at increasing degrees
for deg in [2, 3, 5, 10]:
    print(f"  Degree {deg}: {comb(10 + deg, deg):,} dimensions")
print(f"  Degree 10 on 50 features: {comb(50 + 10, 10):,} dimensions")

That last line is the punchline. Fifty input features at degree 10 maps to over 75 billion dimensions. No computer is fitting a linear model in a space that large. Yet the kernel trick will let us operate in that space — and even infinite-dimensional ones — in O(n) time per dot product.

2. The Kernel Trick — Dot Products Without Feature Maps

Here's the key insight that makes kernel methods work: many ML algorithms only access data through pairwise dot products ⟨φ(x_i), φ(x_j)⟩. Support vector machines optimize over dot products in the dual formulation. Ridge regression can be rewritten in terms of the Gram matrix. PCA operates on the covariance matrix X^TX, which is built from dot products.

If we can compute the dot product in feature space directly — without ever computing the features — we get the benefits of the high-dimensional space for free. A kernel function does exactly this:

k(x, y) = ⟨φ(x), φ(y)⟩

The polynomial kernel k(x, y) = (x^Ty + 1)^d computes the dot product in the degree-d polynomial feature space. Let's prove it for degree 2 with 2D inputs. Expanding (x^Ty + 1)² algebraically:

(x₁y₁ + x₂y₂ + 1)² = x₁²y₁² + x₂²y₂² + 2x₁x₂y₁y₂ + 2x₁y₁ + 2x₂y₂ + 1

This is exactly φ(x)^Tφ(y) where φ(x) = [x₁², x₂², √2·x₁x₂, √2·x₁, √2·x₂, 1]. The kernel computes a 6-dimensional dot product using a single scalar computation — one dot product and one square. The RBF kernel k(x, y) = exp(−‖x−y‖²/2σ²) is even more remarkable: its Taylor expansion has infinitely many terms, so it implicitly maps to an infinite-dimensional feature space. Yet evaluating it takes O(n) time.

import numpy as np
import time

# Two random vectors
rng = np.random.RandomState(0)
x = rng.randn(500)
y = rng.randn(500)

# Method 1: Explicit degree-2 features, then dot product
from itertools import combinations_with_replacement
from collections import Counter
from math import factorial

def explicit_poly_dot(x, y, degree=2):
    """Compute dot product in polynomial feature space explicitly."""
    features_x, features_y = [], []
    d = len(x)
    for deg in range(degree + 1):
        for combo in combinations_with_replacement(range(d), deg):
            val_x = val_y = 1.0
            for idx in combo:
                val_x *= x[idx]
                val_y *= y[idx]
            counts = Counter(combo)
            coeff = factorial(deg)
            for c in counts.values():
                coeff //= factorial(c)
            features_x.append(val_x * np.sqrt(coeff))
            features_y.append(val_y * np.sqrt(coeff))
    return np.dot(features_x, features_y)

# Method 2: Polynomial kernel (one line!)
def poly_kernel(x, y, degree=2, c=1.0):
    return (np.dot(x, y) + c) ** degree

# Verify they match
explicit = explicit_poly_dot(x, y, degree=2)
kernel = poly_kernel(x, y, degree=2)
print(f"Explicit:  {explicit:.6f}")
print(f"Kernel:    {kernel:.6f}")
print(f"Match:     {abs(explicit - kernel) < 1e-6}")

# Timing comparison
t0 = time.time()
for _ in range(10):
    explicit_poly_dot(x, y, degree=2)
t_explicit = (time.time() - t0) / 10

t0 = time.time()
for _ in range(10000):
    poly_kernel(x, y, degree=2)
t_kernel = (time.time() - t0) / 10000

print(f"\nExplicit: {t_explicit*1000:.2f} ms | Kernel: {t_kernel*1000:.4f} ms")
print(f"Kernel is {t_explicit/t_kernel:.0f}x faster")

The kernel function produces the identical result thousands of times faster. This is the kernel trick in action: we get exact equivalence with the high-dimensional feature space at a fraction of the computational cost. And for the RBF kernel, we get a result that would be impossible to compute explicitly — you can't enumerate infinite features.

3. Mercer's Theorem — When Is a Function a Valid Kernel?

Not every function of two arguments is a valid kernel. If k(x, y) doesn't correspond to a dot product in some feature space, then algorithms built on top of it lose their theoretical guarantees — optimization problems may become non-convex, and solutions may not exist.

Mercer's theorem gives us a practical test. Build the Gram matrix K where K_ij = k(x_i, x_j) for any finite set of data points. If K is positive semi-definite (PSD) for every possible set of points, then k is a valid kernel. PSD means all eigenvalues of K are non-negative — equivalently, z^TKz ≥ 0 for any vector z.

The intuition: PSD means the "distances" implied by the kernel are geometrically consistent. In a real feature space, squared distances are always non-negative. If your Gram matrix has negative eigenvalues, it's describing a geometry that can't exist in any real vector space — like a triangle where the longest side is shorter than the sum of the other two... but in the wrong direction.

The representer theorem is the practical payoff: for any kernel-based learning problem with a regularizer, the optimal solution is a linear combination of kernel evaluations at training points. This means we never need to work with the (possibly infinite-dimensional) feature vectors explicitly — the solution lives in a space of dimension equal to the number of training points.

import numpy as np

rng = np.random.RandomState(42)
X = rng.randn(50, 2)

def gram_matrix(X, kernel_fn):
    n = X.shape[0]
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            K[i, j] = kernel_fn(X[i], X[j])
    return K

# Valid kernels — all eigenvalues >= 0
def k_linear(x, y): return np.dot(x, y)
def k_poly(x, y): return (np.dot(x, y) + 1) ** 3
def k_rbf(x, y): return np.exp(-np.sum((x - y)**2) / 2.0)

# Invalid "kernel" — not PSD for all point sets
def k_bad(x, y): return np.sin(np.dot(x, y))

for name, kfn in [("Linear", k_linear), ("Poly(3)", k_poly),
                   ("RBF", k_rbf), ("sin(xTy)", k_bad)]:
    K = gram_matrix(X, kfn)
    eigvals = np.linalg.eigvalsh(K)
    min_eig = eigvals.min()
    neg_count = np.sum(eigvals < -1e-10)
    status = "VALID (PSD)" if neg_count == 0 else f"INVALID ({neg_count} negative eigenvalues)"
    print(f"{name:<10} min eigenvalue: {min_eig:+.4f}  {status}")

The valid kernels (linear, polynomial, RBF) always produce Gram matrices with non-negative eigenvalues. The sin function sometimes produces negative eigenvalues — it doesn't correspond to any dot product in any feature space, so using it in a kernel algorithm could produce nonsensical results.

4. A Kernel Zoo — Designing Kernels for Different Data

Choosing the right kernel is like choosing the right lens to look at your data. Each kernel embodies different assumptions about what "similarity" means, and matching the kernel to your data's structure is often more important than tuning the algorithm itself.

Linear: k(x, y) = x^Ty — no feature mapping at all. The baseline. Use when data is already linearly separable or as a sanity check.
Polynomial: k(x, y) = (x^Ty + c)^d — captures feature interactions up to degree d. The constant c controls the weight of lower-degree terms. Degree 2 is popular for image classification.
RBF (Gaussian): k(x, y) = exp(−‖x−y‖²/2σ²) — the universal default. Maps to infinite dimensions. The bandwidth σ controls locality: small σ means only very close points are "similar" (risk of overfitting), large σ treats distant points as similar (risk of underfitting).
Laplacian: k(x, y) = exp(−‖x−y‖₁/σ) — uses the L1 (Manhattan) norm instead of L2. Produces sharper peaks than RBF. Works well with sparse, high-dimensional data.
Matérn: a family parameterized by smoothness ν. At ν=1/2 it equals the Laplacian kernel; at ν→∞ it converges to the RBF kernel. The sweet spots ν=3/2 and ν=5/2 model "realistically rough" functions — smooth but not infinitely differentiable.
Cosine similarity: k(x, y) = x^Ty / (‖x‖·‖y‖) — measures the angle between vectors, ignoring magnitude. The standard choice for text and other high-dimensional normalized data.

import numpy as np

def k_linear(x, y):
    return np.dot(x, y)

def k_poly(x, y, degree=3, c=1.0):
    return (np.dot(x, y) + c) ** degree

def k_rbf(x, y, sigma=1.0):
    return np.exp(-np.sum((x - y)**2) / (2 * sigma**2))

def k_laplacian(x, y, sigma=1.0):
    return np.exp(-np.sum(np.abs(x - y)) / sigma)

def k_matern32(x, y, sigma=1.0):
    r = np.sqrt(np.sum((x - y)**2)) / sigma
    return (1 + np.sqrt(3) * r) * np.exp(-np.sqrt(3) * r)

def k_cosine(x, y):
    nx, ny = np.linalg.norm(x), np.linalg.norm(y)
    if nx < 1e-12 or ny < 1e-12:
        return 0.0
    return np.dot(x, y) / (nx * ny)

# Visualize: fix reference point, compute kernel over a grid
x0 = np.array([1.0, 0.0])
grid = np.linspace(-3, 3, 100)
xx, yy = np.meshgrid(grid, grid)

kernels = {"Linear": k_linear, "Poly(3)": k_poly, "RBF": k_rbf,
           "Laplacian": k_laplacian, "Matern 3/2": k_matern32,
           "Cosine": k_cosine}

for name, kfn in kernels.items():
    heatmap = np.zeros_like(xx)
    for i in range(100):
        for j in range(100):
            heatmap[i, j] = kfn(x0, np.array([xx[i, j], yy[i, j]]))
    print(f"{name:<12} range: [{heatmap.min():.3f}, {heatmap.max():.3f}]")

Each kernel's heatmap reveals its "similarity lens." The linear kernel produces a gradient (direction-sensitive). The RBF creates a bell curve centered on the reference point (distance-sensitive, isotropic). The Laplacian creates a sharper, tent-like peak. The polynomial creates complex multi-lobed patterns. Choosing the right kernel means choosing the right notion of similarity for your problem.

5. Kernel Composition — Building Complex Kernels from Simple Ones

One of the most powerful properties of kernels is that they're closed under composition. If k₁ and k₂ are valid kernels, then so are:

k₁ + k₂ — equivalent to concatenating the two feature spaces. The composed kernel measures similarity in either sense.
k₁ · k₂ — equivalent to taking all pairwise products of features. The composed kernel requires similarity in both senses simultaneously.
α·k for α > 0 — scales the feature space, adjusting the relative importance of one kernel vs. another in a sum.

This kernel algebra lets you build custom similarity measures for complex data. Working with spatio-temporal data? Use k_spatial + k_temporal. Have both text and numeric features? Multiply a string kernel on text with an RBF on numbers. Each combination is provably a valid kernel — no need to verify PSD from scratch.

Automatic Relevance Determination (ARD) is a powerful application: instead of a single bandwidth σ, use a per-feature bandwidth σ_i. The kernel k(x, y) = exp(−Σ_i (x_i−y_i)²/2σ_i²) automatically downweights irrelevant features by learning large σ_i values for them. This is kernel-based feature selection.

import numpy as np

rng = np.random.RandomState(42)

# Dataset: class depends on BOTH a radial pattern (x1, x2)
# and a linear trend (x3)
n = 200
X = rng.randn(n, 3)
# Radial boundary on first two dims + linear on third
y = np.sign((X[:, 0]**2 + X[:, 1]**2 - 1.0) + 0.8 * X[:, 2])

def gram(X, kernel_fn):
    n = X.shape[0]
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(i, n):
            K[i, j] = K[j, i] = kernel_fn(X[i], X[j])
    return K

def kernel_ridge_accuracy(K, y, lam=0.1):
    alpha = np.linalg.solve(K + lam * np.eye(len(y)), y)
    preds = np.sign(K @ alpha)
    return np.mean(preds == y)

# RBF on spatial features only
K_spatial = gram(X[:, :2], lambda a, b: np.exp(-np.sum((a-b)**2) / 2.0))
# Linear on trend feature only
K_linear = gram(X[:, 2:], lambda a, b: np.dot(a, b))
# Composed: sum of both kernels
K_composed = K_spatial + K_linear

print(f"RBF (spatial only):  {kernel_ridge_accuracy(K_spatial, y):.1%}")
print(f"Linear (trend only): {kernel_ridge_accuracy(K_linear, y):.1%}")
print(f"Composed (sum):      {kernel_ridge_accuracy(K_composed, y):.1%}")

# Kernel alignment: how well does K match the ideal Gram matrix?
def kernel_alignment(K, y):
    y_outer = np.outer(y, y)  # ideal: K_ij = yi * yj
    num = np.sum(K * y_outer)
    denom = np.sqrt(np.sum(K * K) * np.sum(y_outer * y_outer))
    return num / denom

for name, K in [("Spatial", K_spatial), ("Linear", K_linear),
                ("Composed", K_composed)]:
    print(f"{name:<12} alignment: {kernel_alignment(K, y):.3f}")

The composed kernel outperforms either component alone because the classification boundary depends on both spatial structure (a circle in x₁, x₂) and a linear trend (in x₃). Neither the RBF kernel nor the linear kernel can capture both patterns, but their sum can. The kernel alignment score confirms this: the composed kernel's Gram matrix most closely resembles the ideal label-based similarity matrix.

6. Kernelized Algorithms — Ridge, PCA, and K-Means Without Features

The kernel trick isn't specific to SVMs. Any algorithm that accesses data only through dot products can be "kernelized" by replacing ⟨x_i, x_j⟩ with k(x_i, x_j). Here are three examples that demonstrate the generality of the idea.

Kernel Ridge Regression

Standard ridge regression finds weights w = (X^TX + λI)⁻¹X^Ty. The kernelized version works with the Gram matrix instead: α = (K + λI)⁻¹y, and predictions for a new point x* are f(x*) = Σ_i α_i k(x_i, x*). This is O(n³) in training samples rather than O(d³) in features — better when the feature space is huge (or infinite).

Kernel PCA

Standard PCA eigendecomposes the covariance matrix X^TX. Kernel PCA eigendecomposes the centered Gram matrix K̃ = HKH, where H = I − 11^T/n. The eigenvectors of this centered Gram matrix give the coordinates of each data point in the kernel principal component space — a nonlinear dimensionality reduction that can unwind spirals and separate concentric circles.

Kernel K-Means

Standard k-means uses ‖x_i − μ_j‖², which requires explicit centroids. Kernel k-means rewrites distances using only kernel evaluations: ‖φ(x_i) − μ_j‖² = k(x_i, x_i) − 2/|C_j| Σ_{x∈C_j} k(x_i, x) + 1/|C_j|² Σ_{x,x'∈C_j} k(x, x'). No explicit centroids, no explicit features — just a matrix of kernel values.

import numpy as np

rng = np.random.RandomState(42)

# --- Kernel Ridge Regression on nonlinear data ---
X_reg = np.sort(rng.uniform(-3, 3, 60))
y_reg = np.sin(X_reg) + 0.2 * rng.randn(60)

def rbf(x, y, sigma=0.5):
    return np.exp(-((x - y) ** 2) / (2 * sigma ** 2))

K = np.array([[rbf(xi, xj) for xj in X_reg] for xi in X_reg])
alpha = np.linalg.solve(K + 0.01 * np.eye(60), y_reg)

# Predict on a fine grid
X_test = np.linspace(-3, 3, 200)
y_pred = np.array([sum(a * rbf(xt, xi) for a, xi in zip(alpha, X_reg))
                    for xt in X_test])
mse_kernel = np.mean((np.sin(X_test) - y_pred) ** 2)

# Compare: linear ridge
w_lin = np.linalg.solve(X_reg[:, None].T @ X_reg[:, None]
        + 0.01 * np.eye(1), X_reg[:, None].T @ y_reg)
y_lin = X_test * w_lin[0]
mse_linear = np.mean((np.sin(X_test) - y_lin) ** 2)

print(f"Kernel ridge MSE:  {mse_kernel:.4f}")
print(f"Linear ridge MSE:  {mse_linear:.4f}")

# --- Kernel PCA on concentric circles ---
n = 200
theta = rng.uniform(0, 2 * np.pi, n)
r_inner, r_outer = 1.0, 3.0
X_circles = np.vstack([
    np.column_stack([r_inner * np.cos(theta[:100]), r_inner * np.sin(theta[:100])]),
    np.column_stack([r_outer * np.cos(theta[100:]), r_outer * np.sin(theta[100:])])
])
X_circles += 0.2 * rng.randn(n, 2)
labels = np.array([0]*100 + [1]*100)

# Gram matrix
K_c = np.array([[np.exp(-np.sum((X_circles[i]-X_circles[j])**2)/2.0)
                 for j in range(n)] for i in range(n)])
# Center: K_tilde = HKH
H = np.eye(n) - np.ones((n, n)) / n
K_tilde = H @ K_c @ H
eigvals, eigvecs = np.linalg.eigh(K_tilde)
# Take top 2 components (largest eigenvalues)
idx = np.argsort(eigvals)[::-1][:2]
Z = eigvecs[:, idx] * np.sqrt(np.abs(eigvals[idx]))

# Check if first kernel PC separates the classes
threshold = np.median(Z[:, 0])
kpca_preds = (Z[:, 0] > threshold).astype(int)
kpca_acc = max(np.mean(kpca_preds == labels), np.mean(kpca_preds != labels))
print(f"\nKernel PCA: 1st component separates circles with {kpca_acc:.1%} accuracy")

Kernel ridge regression fits the sinusoidal data almost perfectly while linear ridge draws a straight line through the curves. Kernel PCA separates concentric circles using just the first component — something linear PCA fundamentally cannot do, since projection onto any line will mix the two circles. The kernel implicitly maps the circles into a higher-dimensional space where they become linearly separable.

Try It: Kernel Feature Space Visualizer

Dataset: Kernel: σ: 0.5

Kernel PCA embedding shows how the kernel lifts data into a separable space.

Try It: Gram Matrix Explorer

Kernel: σ: 0.8

Hover over points to see their similarity row. All eigenvalues are non-negative (PSD).

7. References & Further Reading

Schölkopf & Smola — Learning with Kernels — The definitive textbook on kernel methods and SVMs (MIT Press, 2002)
Shawe-Taylor & Cristianini — Kernel Methods for Pattern Analysis — Accessible introduction with emphasis on applications (Cambridge, 2004)
Hofmann, Schölkopf & Smola — "Kernel Methods in Machine Learning" — Comprehensive survey paper (Annals of Statistics, 2008)
Rasmussen & Williams — Gaussian Processes for Machine Learning, Ch. 4 — Kernel design and covariance functions for GPs (2006)
Bishop — Pattern Recognition and Machine Learning, Ch. 6 — Kernels in the context of probabilistic models (Springer, 2006)