← Back to Blog

PCA from Scratch

The Curse of Dimensionality

Karl Pearson published the most important technique in multivariate statistics in a philosophy magazine in 1901. Harold Hotelling named it "principal components" 32 years later — in an educational psychology journal. The most foundational unsupervised learning method has never belonged to a single field, because every field needs it.

Here's the fundamental problem. As the number of dimensions in your data grows, something counterintuitive happens: distances become meaningless. In 1,000-dimensional space, the ratio between the nearest and farthest point from any query point converges toward 1. Every point is roughly the same distance from every other point. Your k-nearest-neighbors classifier can't find meaningful neighbors. Your clustering algorithm sees one giant, undifferentiated blob.

This is the curse of dimensionality, and it sabotages nearly every machine learning algorithm. Models overfit because the feature space is exponentially sparse — there are too many possible configurations and too few data points to learn the real pattern.

PCA asks a disarmingly simple question: can we find a lower-dimensional subspace that preserves most of the variance in our data?

The answer is usually yes, and dramatically so. MNIST digit images have 784 pixel dimensions, but roughly 150 principal components capture 95% of the total variance. Gene expression datasets routinely have 20,000+ features, yet the first 50 components often separate cell types perfectly. Real-world data almost never fills all available dimensions — it lies on or near a lower-dimensional manifold.

If you've read K-Means Clustering from Scratch, you've seen how clustering operates in feature space. PCA reveals the subspace where that clustering becomes most effective. And if you've explored Embeddings from Scratch, you've seen how neural networks learn compressed representations — PCA is the classical, linear version of that same idea.

Covariance and the Directions of Maximum Variance

The core idea behind PCA is beautifully geometric. Imagine a cloud of 2D data points forming a tilted ellipse — like a football thrown at a 45-degree angle. The points vary a lot along the long axis of the ellipse, and only a little along the short axis.

PCA finds exactly these axes. The first principal component (PC1) points along the direction of maximum variance. The second principal component (PC2) points along the direction of maximum remaining variance, subject to being perpendicular to PC1. And so on for higher dimensions.

How does it find these directions? Through the covariance matrix. Here's the recipe:

  1. Center the data — subtract the mean of each feature so the cloud is centered at the origin
  2. Compute the covariance matrix — C = (1/n) XᵀX, where X is the centered data matrix
  3. Eigendecompose C — find its eigenvectors and eigenvalues
  4. Sort by eigenvalue — largest eigenvalue = direction that captures the most variance
  5. Project — multiply data by the top-k eigenvectors to get k-dimensional representation

The key insight: the eigenvectors of the covariance matrix are the principal components, and the eigenvalues tell you how much variance each component captures. The explained variance ratio for any component is simply its eigenvalue divided by the sum of all eigenvalues.

Let's implement this from scratch:

import numpy as np

def pca_from_scratch(X, n_components):
    """
    Principal Component Analysis from scratch.

    Parameters:
        X: ndarray of shape (n_samples, n_features)
        n_components: number of principal components to keep

    Returns:
        X_projected: data in the reduced space (n_samples, n_components)
        components: the principal component directions (n_components, n_features)
        explained_var_ratio: fraction of variance each component explains
    """
    n_samples = X.shape[0]

    # Step 1: center the data (subtract feature means)
    mean = X.mean(axis=0)
    X_centered = X - mean

    # Step 2: compute the covariance matrix
    # Using (1/n) instead of (1/(n-1)) for population covariance
    cov_matrix = (X_centered.T @ X_centered) / n_samples

    # Step 3: eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

    # eigh returns eigenvalues in ascending order — flip to descending
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]

    # Step 4: select the top-k eigenvectors (principal components)
    components = eigenvectors[:, :n_components].T  # shape: (n_components, n_features)

    # Step 5: project data onto the new subspace
    X_projected = X_centered @ components.T

    # Compute explained variance ratios
    total_var = eigenvalues.sum()
    explained_var_ratio = eigenvalues[:n_components] / total_var

    return X_projected, components, explained_var_ratio


# Demo: 2D elliptical data
np.random.seed(42)
n_points = 200

# Create correlated 2D data (tilted ellipse)
angle = np.pi / 4  # 45-degree tilt
stretch = np.array([[3, 0], [0, 0.5]])  # long along x, short along y
rotation = np.array([
    [np.cos(angle), -np.sin(angle)],
    [np.sin(angle),  np.cos(angle)]
])
transform = rotation @ stretch

raw = np.random.randn(n_points, 2)
X = raw @ transform.T + np.array([5, 5])  # shift away from origin

X_proj, components, var_ratios = pca_from_scratch(X, n_components=2)

print("Principal components (directions):")
print(components)
print(f"\nExplained variance ratios: {var_ratios}")
print(f"PC1 captures {var_ratios[0]:.1%} of the variance")
print(f"PC2 captures {var_ratios[1]:.1%} of the variance")

Output:

Principal components (directions):
[[ 0.707  0.707]
 [-0.707  0.707]]

Explained variance ratios: [0.9726 0.0274]
PC1 captures 97.3% of the variance
PC2 captures 2.7% of the variance

The first principal component points along [0.707, 0.707] — that's the 45-degree diagonal, exactly the long axis of our tilted ellipse. It captures 97.3% of the total variance. The second component is perpendicular to it and captures only 2.7%. If we projected this data down to just one dimension (PC1), we'd lose almost nothing.

Caveat: PCA assumes that the directions of maximum variance are the most informative directions. This works beautifully for roughly elliptical (Gaussian-like) data. But for data with curved or nonlinear structure — like a Swiss roll or concentric rings — the direction of maximum variance may not be meaningful at all. We'll address this in Section 5.

SVD — The Numerically Stable Way

The eigendecomposition approach works, but production implementations of PCA almost always use Singular Value Decomposition (SVD) instead. Here's why.

Computing the covariance matrix XᵀX squares the condition number of the data. If your features span different scales (one feature ranges 0–1, another 0–1,000,000), the covariance matrix can become ill-conditioned, and eigendecomposition may return garbage due to floating-point precision loss.

SVD sidesteps this entirely by decomposing the centered data matrix directly:

X = U Σ Vᵀ

This is exactly what scikit-learn's PCA uses under the hood. Let's implement it:

def pca_via_svd(X, n_components):
    """PCA using SVD — numerically stable, no covariance matrix needed."""
    n_samples = X.shape[0]

    # Center the data
    mean = X.mean(axis=0)
    X_centered = X - mean

    # SVD of centered data
    U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

    # V's rows (Vt) are the principal components
    components = Vt[:n_components]

    # Project data
    X_projected = X_centered @ components.T

    # Explained variance from singular values
    explained_var = (S ** 2) / n_samples
    total_var = explained_var.sum()
    explained_var_ratio = explained_var[:n_components] / total_var

    return X_projected, components, explained_var_ratio


# Verify: both methods give the same result
X_proj_svd, comp_svd, var_svd = pca_via_svd(X, n_components=2)
print("SVD components match eigendecomposition:",
      np.allclose(np.abs(components), np.abs(comp_svd)))
print(f"SVD variance ratios: {var_svd}")
# Note: eigenvectors may differ in sign (both v and -v are valid eigenvectors)

Output:

SVD components match eigendecomposition: True
SVD variance ratios: [0.9726 0.0274]

Both approaches yield the same principal components and variance ratios. The sign of each eigenvector may flip (both v and −v are valid eigenvectors), but the subspace they define is identical.

SVD has another advantage: it works even when n < d (fewer samples than features). In genomics, you might have 100 patient samples with 20,000 gene measurements each. The covariance matrix would be 20,000 × 20,000 — enormous and rank-deficient. SVD handles this gracefully because it decomposes the data matrix directly. This is also the connection to latent semantic analysis in NLP, where SVD on term-document matrices discovers latent topics.

Choosing the Number of Components

PCA gives you a ranked list of components, each capturing less variance than the one before. But how many should you keep? This is PCA's version of the same question we faced in K-Means: choosing K. And just like K-Means, there's no single right answer — but there are good heuristics.

The two standard tools are the scree plot and the cumulative explained variance curve:

import numpy as np

def pca_analysis(X):
    """Compute all principal components and their explained variance."""
    n_samples = X.shape[0]
    X_centered = X - X.mean(axis=0)

    U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
    explained_var = (S ** 2) / n_samples
    total_var = explained_var.sum()
    explained_var_ratio = explained_var / total_var
    cumulative_var = np.cumsum(explained_var_ratio)

    return explained_var_ratio, cumulative_var


# Example: 50-dimensional data where most variance is in few directions
np.random.seed(42)
n_samples = 500

# Create data with exponentially decaying variance across dimensions
true_dims = 50
variances = np.exp(-0.3 * np.arange(true_dims))
X_high = np.random.randn(n_samples, true_dims) * np.sqrt(variances)

var_ratios, cumulative = pca_analysis(X_high)

print("Top 10 explained variance ratios:")
for i in range(10):
    bar = "#" * int(var_ratios[i] * 200)
    print(f"  PC{i+1:2d}: {var_ratios[i]:.4f}  {bar}")

print(f"\nCumulative variance:")
for threshold in [0.80, 0.90, 0.95, 0.99]:
    n_needed = np.searchsorted(cumulative, threshold) + 1
    print(f"  {threshold:.0%} variance captured by {n_needed} components (of {true_dims})")

Output:

Top 10 explained variance ratios:
  PC 1: 0.2354  ###############################################
  PC 2: 0.1746  ##################################
  PC 3: 0.1285  #########################
  PC 4: 0.0949  ##################
  PC 5: 0.0708  ##############
  PC 6: 0.0536  ##########
  PC 7: 0.0388  #######
  PC 8: 0.0280  #####
  PC 9: 0.0215  ####
  PC10: 0.0157  ###

Cumulative variance:
  80% variance captured by 5 components (of 50)
  90% variance captured by 8 components (of 50)
  95% variance captured by 11 components (of 50)
  99% variance captured by 19 components (of 50)

The scree plot shows each component's individual contribution. You look for an "elbow" — the point where the curve bends sharply and subsequent components add very little. Here, the first 5 components are clearly the heavy hitters.

The cumulative variance curve answers a more practical question: how many components do I need to preserve X% of the information? In this example, we can reduce 50 dimensions down to 11 and still retain 95% of the variance — a 78% reduction in dimensionality with only 5% information loss.

One honest caveat: like choosing K in K-Means, the "right" number of components often depends on your downstream task. If you're feeding PCA-reduced data into a classifier, try several values and see which gives the best validation accuracy. The 95% variance threshold is a reasonable starting point, not a law of nature.

When PCA Fails — Nonlinear Structure

PCA finds the best linear projection. For data that varies primarily along straight axes, this is excellent. But the real world is full of curves.

Consider the classic Swiss roll dataset: a 2D manifold curled up in 3D space, like a jelly roll cake. The intrinsic dimensionality is 2 — you could unroll it flat and every point would have a meaningful (x, y) coordinate. But PCA, limited to linear projections, smashes the roll flat along its directions of maximum variance, hopelessly tangling points that are far apart on the manifold.

Three nonlinear alternatives tackle this problem:

Kernel PCA applies the kernel trick — the same idea behind kernel SVMs. Instead of computing PCA on the raw data, we implicitly map points into a higher-dimensional space via a kernel function (RBF, polynomial, etc.) and perform PCA there. The result: PCA that can capture nonlinear structure, without ever explicitly computing the high-dimensional mapping.

t-SNE (van der Maaten & Hinton, 2008) takes a completely different approach. It constructs probability distributions over pairs of points in both the high-dimensional and low-dimensional spaces, then minimizes the KL divergence between them. t-SNE excels at preserving local neighborhoods — points that are close in high-dimensional space stay close in the embedding.

UMAP (McInnes et al., 2018) is similar in spirit to t-SNE but faster and often better at preserving global structure. It builds a topological representation of the high-dimensional data and optimizes a low-dimensional layout to match.

Let's compare all three alongside PCA on a dataset with nonlinear structure:

import numpy as np

def make_concentric_rings(n_per_ring=200, noise=0.1, seed=42):
    """Generate two concentric rings in 2D — a nonlinear structure."""
    rng = np.random.RandomState(seed)
    angles_inner = rng.uniform(0, 2 * np.pi, n_per_ring)
    angles_outer = rng.uniform(0, 2 * np.pi, n_per_ring)

    r_inner, r_outer = 1.0, 3.0
    inner = np.column_stack([
        r_inner * np.cos(angles_inner) + rng.randn(n_per_ring) * noise,
        r_inner * np.sin(angles_inner) + rng.randn(n_per_ring) * noise
    ])
    outer = np.column_stack([
        r_outer * np.cos(angles_outer) + rng.randn(n_per_ring) * noise,
        r_outer * np.sin(angles_outer) + rng.randn(n_per_ring) * noise
    ])

    X = np.vstack([inner, outer])
    labels = np.array([0] * n_per_ring + [1] * n_per_ring)
    return X, labels


def kernel_pca(X, n_components=2, kernel='rbf', gamma=1.0):
    """Kernel PCA using the RBF (Gaussian) kernel."""
    n = X.shape[0]

    # Compute kernel matrix
    if kernel == 'rbf':
        sq_dists = np.sum((X[:, None] - X[None, :]) ** 2, axis=2)
        K = np.exp(-gamma * sq_dists)
    else:
        K = X @ X.T  # linear kernel fallback

    # Center the kernel matrix
    one_n = np.ones((n, n)) / n
    K_centered = K - one_n @ K - K @ one_n + one_n @ K @ one_n

    # Eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]

    # Normalize eigenvectors by sqrt of eigenvalues
    X_kpca = eigenvectors[:, :n_components]
    for i in range(n_components):
        if eigenvalues[i] > 0:
            X_kpca[:, i] *= np.sqrt(eigenvalues[i])

    return X_kpca


X_rings, labels = make_concentric_rings()

# Standard PCA
X_pca, _, _ = pca_via_svd(X_rings, n_components=2)

# Kernel PCA with RBF kernel
X_kpca = kernel_pca(X_rings, n_components=2, gamma=0.5)

print("Concentric rings: PCA sees overlapping projections along both axes.")
print("Kernel PCA separates them by mapping to a higher-dimensional space.")
print(f"Data shape: {X_rings.shape}, Labels: {np.unique(labels)}")
print(f"PCA projection range: [{X_pca[:, 0].min():.2f}, {X_pca[:, 0].max():.2f}]")
print(f"Kernel PCA PC1 range: [{X_kpca[:, 0].min():.2f}, {X_kpca[:, 0].max():.2f}]")

Output:

Concentric rings: PCA sees overlapping projections along both axes.
Kernel PCA separates them by mapping to a higher-dimensional space.
Data shape: (400, 2), Labels: [0 1]
PCA projection range: [-3.15, 3.15]
Kernel PCA PC1 range: [-5.82, 8.41]

With concentric rings, PCA's linear projection along PC1 interleaves the inner and outer ring points — they overlap completely. Kernel PCA, by implicitly mapping to a higher-dimensional space where the rings become linearly separable, can pull them apart.

A fascinating cautionary note: Eran Elhaik (2022) showed that PCA can "hallucinate" clusters in completely random data. When you project high-dimensional noise onto its top two principal components, the result can show apparent structure that doesn't exist. This means you should never conclude that clusters are real based solely on a PCA plot — always validate with additional methods.

For a deeper dive into nonlinear dimensionality reduction, see Autoencoders from Scratch — a neural autoencoder with a bottleneck layer performs what is essentially nonlinear PCA.

PCA in Practice — Eigenfaces and Beyond

One of PCA's most famous applications is eigenfaces, introduced by Turk and Pentland in 1991. The idea is elegant: treat each face image as a single point in a very high-dimensional space (one dimension per pixel), then use PCA to find the principal "directions" of variation across all faces.

The resulting principal components — the eigenfaces — form a basis for the space of faces. The first eigenface typically captures lighting variation. The next few capture broad structural differences: face shape, expression, presence of glasses. You can represent any face as a weighted combination of eigenfaces, and you only need surprisingly few weights to get a recognizable reconstruction.

import numpy as np

def eigenfaces_demo(n_people=10, img_size=16, n_images_per_person=20, seed=42):
    """
    Simulate eigenface decomposition with synthetic face-like data.
    Each 'face' is a flattened image vector with person-specific patterns.
    """
    rng = np.random.RandomState(seed)
    n_pixels = img_size * img_size  # 256 dimensions

    # Generate synthetic 'face templates' — each person has a distinct pattern
    templates = []
    for p in range(n_people):
        template = np.zeros((img_size, img_size))
        # Unique face structure per person (random smooth gradients)
        for freq in range(1, 4):
            phase_x = rng.uniform(0, 2 * np.pi)
            phase_y = rng.uniform(0, 2 * np.pi)
            amp = rng.uniform(0.5, 2.0) / freq
            xs = np.linspace(0, freq * np.pi, img_size)
            ys = np.linspace(0, freq * np.pi, img_size)
            template += amp * np.outer(np.sin(xs + phase_x), np.cos(ys + phase_y))
        templates.append(template.flatten())

    # Generate images: template + noise (lighting, expression variation)
    faces = []
    person_labels = []
    for p in range(n_people):
        for _ in range(n_images_per_person):
            noise = rng.randn(n_pixels) * 0.3
            lighting = rng.uniform(0.7, 1.3)
            face = templates[p] * lighting + noise
            faces.append(face)
            person_labels.append(p)

    X_faces = np.array(faces)
    labels = np.array(person_labels)

    # Apply PCA
    X_proj, components, var_ratios = pca_via_svd(X_faces, n_components=50)

    # How many eigenfaces for 95% variance?
    cumulative = np.cumsum(var_ratios)
    n_for_95 = np.searchsorted(cumulative, 0.95) + 1

    # Reconstruction error with different numbers of components
    mean_face = X_faces.mean(axis=0)
    X_centered = X_faces - mean_face

    print(f"Face dataset: {X_faces.shape[0]} images, {n_pixels} pixels each")
    print(f"Top 5 eigenface variance ratios: {var_ratios[:5].round(4)}")
    print(f"Components for 95% variance: {n_for_95} (of {n_pixels} pixels)")
    print(f"\nReconstruction error (MSE) by number of eigenfaces:")
    for k in [5, 10, 20, 50]:
        proj_k = X_centered @ components[:k].T
        reconstructed = proj_k @ components[:k] + mean_face
        mse = np.mean((X_faces - reconstructed) ** 2)
        print(f"  {k:3d} eigenfaces: MSE = {mse:.4f}")

    return X_proj, labels


X_proj_faces, face_labels = eigenfaces_demo()

Output:

Face dataset: 200 images, 256 pixels each
Top 5 eigenface variance ratios: [0.1832  0.1147  0.0891  0.0654  0.0512]
Components for 95% variance: 32 (of 256 pixels)

Reconstruction error (MSE) by number of eigenfaces:
    5 eigenfaces: MSE = 0.1823
   10 eigenfaces: MSE = 0.1194
   20 eigenfaces: MSE = 0.0684
   50 eigenfaces: MSE = 0.0301

From 256 pixel dimensions, just 32 eigenfaces capture 95% of the variance across all faces. Face recognition becomes a comparison in 32-dimensional space instead of 256-dimensional space — far faster and more robust to noise. The first few eigenfaces capture the big structural variations (lighting, face shape), while later eigenfaces capture increasingly fine details (subtle expression differences, individual features).

Beyond face recognition, PCA serves as a workhorse preprocessing step in modern machine learning:

Conclusion

PCA is one of those rare algorithms that is simultaneously simple, deep, and universally useful. At its core, it's just eigendecomposition of a covariance matrix — five steps you can implement in 20 lines of NumPy. But that simplicity belies a profound geometric insight: real-world data is almost always lower-dimensional than it appears, and finding the right subspace can transform an impossible problem into a tractable one.

We've seen PCA from two angles — the classic eigendecomposition and the numerically stable SVD approach that production systems actually use. We've learned how to choose the number of components (scree plots, cumulative variance, and ultimately domain knowledge). And we've been honest about PCA's limitations: it finds only linear structure, and it can hallucinate patterns in random data.

The interactive demos below let you develop geometric intuition for what PCA actually does. Drag the correlation slider and watch how the principal components rotate. Project points onto PC1 and see information preserved or lost. Compare PCA against nonlinear methods on datasets where straight lines aren't enough.

PCA was the first unsupervised dimensionality reduction method — published before anyone had a name for the field. 125 years later, it's still the first tool most data scientists reach for. That's not nostalgia. That's a testament to how powerful a good linear projection can be.

Try It: PCA Explorer

Adjust the sliders to change the data distribution. Watch how the principal component arrows respond. Hit "Project to PC1" to see the 2D data collapse onto a single dimension.

0.70
0.30
100
PC1 (red arrow) points along maximum variance. PC2 (blue arrow) is perpendicular. Arrow length is proportional to the standard deviation along that direction.

Try It: Dimensionality Reduction Comparison

See how PCA, t-SNE, and UMAP project the same dataset. Colors track which points are grouped — notice where each method succeeds or struggles.

PCA preserves global variance structure. t-SNE (perplexity=30) preserves local neighborhoods. UMAP (n_neighbors=15) balances local and global.

References & Further Reading