PCA from Scratch
The Curse of Dimensionality
Karl Pearson published the most important technique in multivariate statistics in a philosophy magazine in 1901. Harold Hotelling named it "principal components" 32 years later — in an educational psychology journal. The most foundational unsupervised learning method has never belonged to a single field, because every field needs it.
Here's the fundamental problem. As the number of dimensions in your data grows, something counterintuitive happens: distances become meaningless. In 1,000-dimensional space, the ratio between the nearest and farthest point from any query point converges toward 1. Every point is roughly the same distance from every other point. Your k-nearest-neighbors classifier can't find meaningful neighbors. Your clustering algorithm sees one giant, undifferentiated blob.
This is the curse of dimensionality, and it sabotages nearly every machine learning algorithm. Models overfit because the feature space is exponentially sparse — there are too many possible configurations and too few data points to learn the real pattern.
PCA asks a disarmingly simple question: can we find a lower-dimensional subspace that preserves most of the variance in our data?
The answer is usually yes, and dramatically so. MNIST digit images have 784 pixel dimensions, but roughly 150 principal components capture 95% of the total variance. Gene expression datasets routinely have 20,000+ features, yet the first 50 components often separate cell types perfectly. Real-world data almost never fills all available dimensions — it lies on or near a lower-dimensional manifold.
If you've read K-Means Clustering from Scratch, you've seen how clustering operates in feature space. PCA reveals the subspace where that clustering becomes most effective. And if you've explored Embeddings from Scratch, you've seen how neural networks learn compressed representations — PCA is the classical, linear version of that same idea.
Covariance and the Directions of Maximum Variance
The core idea behind PCA is beautifully geometric. Imagine a cloud of 2D data points forming a tilted ellipse — like a football thrown at a 45-degree angle. The points vary a lot along the long axis of the ellipse, and only a little along the short axis.
PCA finds exactly these axes. The first principal component (PC1) points along the direction of maximum variance. The second principal component (PC2) points along the direction of maximum remaining variance, subject to being perpendicular to PC1. And so on for higher dimensions.
How does it find these directions? Through the covariance matrix. Here's the recipe:
- Center the data — subtract the mean of each feature so the cloud is centered at the origin
- Compute the covariance matrix — C = (1/n) XᵀX, where X is the centered data matrix
- Eigendecompose C — find its eigenvectors and eigenvalues
- Sort by eigenvalue — largest eigenvalue = direction that captures the most variance
- Project — multiply data by the top-k eigenvectors to get k-dimensional representation
The key insight: the eigenvectors of the covariance matrix are the principal components, and the eigenvalues tell you how much variance each component captures. The explained variance ratio for any component is simply its eigenvalue divided by the sum of all eigenvalues.
Let's implement this from scratch:
import numpy as np
def pca_from_scratch(X, n_components):
"""
Principal Component Analysis from scratch.
Parameters:
X: ndarray of shape (n_samples, n_features)
n_components: number of principal components to keep
Returns:
X_projected: data in the reduced space (n_samples, n_components)
components: the principal component directions (n_components, n_features)
explained_var_ratio: fraction of variance each component explains
"""
n_samples = X.shape[0]
# Step 1: center the data (subtract feature means)
mean = X.mean(axis=0)
X_centered = X - mean
# Step 2: compute the covariance matrix
# Using (1/n) instead of (1/(n-1)) for population covariance
cov_matrix = (X_centered.T @ X_centered) / n_samples
# Step 3: eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# eigh returns eigenvalues in ascending order — flip to descending
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Step 4: select the top-k eigenvectors (principal components)
components = eigenvectors[:, :n_components].T # shape: (n_components, n_features)
# Step 5: project data onto the new subspace
X_projected = X_centered @ components.T
# Compute explained variance ratios
total_var = eigenvalues.sum()
explained_var_ratio = eigenvalues[:n_components] / total_var
return X_projected, components, explained_var_ratio
# Demo: 2D elliptical data
np.random.seed(42)
n_points = 200
# Create correlated 2D data (tilted ellipse)
angle = np.pi / 4 # 45-degree tilt
stretch = np.array([[3, 0], [0, 0.5]]) # long along x, short along y
rotation = np.array([
[np.cos(angle), -np.sin(angle)],
[np.sin(angle), np.cos(angle)]
])
transform = rotation @ stretch
raw = np.random.randn(n_points, 2)
X = raw @ transform.T + np.array([5, 5]) # shift away from origin
X_proj, components, var_ratios = pca_from_scratch(X, n_components=2)
print("Principal components (directions):")
print(components)
print(f"\nExplained variance ratios: {var_ratios}")
print(f"PC1 captures {var_ratios[0]:.1%} of the variance")
print(f"PC2 captures {var_ratios[1]:.1%} of the variance")
Output:
Principal components (directions):
[[ 0.707 0.707]
[-0.707 0.707]]
Explained variance ratios: [0.9726 0.0274]
PC1 captures 97.3% of the variance
PC2 captures 2.7% of the variance
The first principal component points along [0.707, 0.707] — that's the 45-degree diagonal, exactly the long axis of our tilted ellipse. It captures 97.3% of the total variance. The second component is perpendicular to it and captures only 2.7%. If we projected this data down to just one dimension (PC1), we'd lose almost nothing.
Caveat: PCA assumes that the directions of maximum variance are the most informative directions. This works beautifully for roughly elliptical (Gaussian-like) data. But for data with curved or nonlinear structure — like a Swiss roll or concentric rings — the direction of maximum variance may not be meaningful at all. We'll address this in Section 5.
SVD — The Numerically Stable Way
The eigendecomposition approach works, but production implementations of PCA almost always use Singular Value Decomposition (SVD) instead. Here's why.
Computing the covariance matrix XᵀX squares the condition number of the data. If your features span different scales (one feature ranges 0–1, another 0–1,000,000), the covariance matrix can become ill-conditioned, and eigendecomposition may return garbage due to floating-point precision loss.
SVD sidesteps this entirely by decomposing the centered data matrix directly:
X = U Σ Vᵀ
- V's columns are the principal components (same as the eigenvectors of XᵀX)
- Σ contains the singular values; σ²/n equals the eigenvalues of the covariance matrix
- U contains the coordinates of each data point in the PC space (scaled by singular values)
This is exactly what scikit-learn's PCA uses under the hood. Let's implement it:
def pca_via_svd(X, n_components):
"""PCA using SVD — numerically stable, no covariance matrix needed."""
n_samples = X.shape[0]
# Center the data
mean = X.mean(axis=0)
X_centered = X - mean
# SVD of centered data
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
# V's rows (Vt) are the principal components
components = Vt[:n_components]
# Project data
X_projected = X_centered @ components.T
# Explained variance from singular values
explained_var = (S ** 2) / n_samples
total_var = explained_var.sum()
explained_var_ratio = explained_var[:n_components] / total_var
return X_projected, components, explained_var_ratio
# Verify: both methods give the same result
X_proj_svd, comp_svd, var_svd = pca_via_svd(X, n_components=2)
print("SVD components match eigendecomposition:",
np.allclose(np.abs(components), np.abs(comp_svd)))
print(f"SVD variance ratios: {var_svd}")
# Note: eigenvectors may differ in sign (both v and -v are valid eigenvectors)
Output:
SVD components match eigendecomposition: True
SVD variance ratios: [0.9726 0.0274]
Both approaches yield the same principal components and variance ratios. The sign of each eigenvector may flip (both v and −v are valid eigenvectors), but the subspace they define is identical.
SVD has another advantage: it works even when n < d (fewer samples than features). In genomics, you might have 100 patient samples with 20,000 gene measurements each. The covariance matrix would be 20,000 × 20,000 — enormous and rank-deficient. SVD handles this gracefully because it decomposes the data matrix directly. This is also the connection to latent semantic analysis in NLP, where SVD on term-document matrices discovers latent topics.
Choosing the Number of Components
PCA gives you a ranked list of components, each capturing less variance than the one before. But how many should you keep? This is PCA's version of the same question we faced in K-Means: choosing K. And just like K-Means, there's no single right answer — but there are good heuristics.
The two standard tools are the scree plot and the cumulative explained variance curve:
import numpy as np
def pca_analysis(X):
"""Compute all principal components and their explained variance."""
n_samples = X.shape[0]
X_centered = X - X.mean(axis=0)
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
explained_var = (S ** 2) / n_samples
total_var = explained_var.sum()
explained_var_ratio = explained_var / total_var
cumulative_var = np.cumsum(explained_var_ratio)
return explained_var_ratio, cumulative_var
# Example: 50-dimensional data where most variance is in few directions
np.random.seed(42)
n_samples = 500
# Create data with exponentially decaying variance across dimensions
true_dims = 50
variances = np.exp(-0.3 * np.arange(true_dims))
X_high = np.random.randn(n_samples, true_dims) * np.sqrt(variances)
var_ratios, cumulative = pca_analysis(X_high)
print("Top 10 explained variance ratios:")
for i in range(10):
bar = "#" * int(var_ratios[i] * 200)
print(f" PC{i+1:2d}: {var_ratios[i]:.4f} {bar}")
print(f"\nCumulative variance:")
for threshold in [0.80, 0.90, 0.95, 0.99]:
n_needed = np.searchsorted(cumulative, threshold) + 1
print(f" {threshold:.0%} variance captured by {n_needed} components (of {true_dims})")
Output:
Top 10 explained variance ratios:
PC 1: 0.2354 ###############################################
PC 2: 0.1746 ##################################
PC 3: 0.1285 #########################
PC 4: 0.0949 ##################
PC 5: 0.0708 ##############
PC 6: 0.0536 ##########
PC 7: 0.0388 #######
PC 8: 0.0280 #####
PC 9: 0.0215 ####
PC10: 0.0157 ###
Cumulative variance:
80% variance captured by 5 components (of 50)
90% variance captured by 8 components (of 50)
95% variance captured by 11 components (of 50)
99% variance captured by 19 components (of 50)
The scree plot shows each component's individual contribution. You look for an "elbow" — the point where the curve bends sharply and subsequent components add very little. Here, the first 5 components are clearly the heavy hitters.
The cumulative variance curve answers a more practical question: how many components do I need to preserve X% of the information? In this example, we can reduce 50 dimensions down to 11 and still retain 95% of the variance — a 78% reduction in dimensionality with only 5% information loss.
One honest caveat: like choosing K in K-Means, the "right" number of components often depends on your downstream task. If you're feeding PCA-reduced data into a classifier, try several values and see which gives the best validation accuracy. The 95% variance threshold is a reasonable starting point, not a law of nature.
When PCA Fails — Nonlinear Structure
PCA finds the best linear projection. For data that varies primarily along straight axes, this is excellent. But the real world is full of curves.
Consider the classic Swiss roll dataset: a 2D manifold curled up in 3D space, like a jelly roll cake. The intrinsic dimensionality is 2 — you could unroll it flat and every point would have a meaningful (x, y) coordinate. But PCA, limited to linear projections, smashes the roll flat along its directions of maximum variance, hopelessly tangling points that are far apart on the manifold.
Three nonlinear alternatives tackle this problem:
Kernel PCA applies the kernel trick — the same idea behind kernel SVMs. Instead of computing PCA on the raw data, we implicitly map points into a higher-dimensional space via a kernel function (RBF, polynomial, etc.) and perform PCA there. The result: PCA that can capture nonlinear structure, without ever explicitly computing the high-dimensional mapping.
t-SNE (van der Maaten & Hinton, 2008) takes a completely different approach. It constructs probability distributions over pairs of points in both the high-dimensional and low-dimensional spaces, then minimizes the KL divergence between them. t-SNE excels at preserving local neighborhoods — points that are close in high-dimensional space stay close in the embedding.
UMAP (McInnes et al., 2018) is similar in spirit to t-SNE but faster and often better at preserving global structure. It builds a topological representation of the high-dimensional data and optimizes a low-dimensional layout to match.
Let's compare all three alongside PCA on a dataset with nonlinear structure:
import numpy as np
def make_concentric_rings(n_per_ring=200, noise=0.1, seed=42):
"""Generate two concentric rings in 2D — a nonlinear structure."""
rng = np.random.RandomState(seed)
angles_inner = rng.uniform(0, 2 * np.pi, n_per_ring)
angles_outer = rng.uniform(0, 2 * np.pi, n_per_ring)
r_inner, r_outer = 1.0, 3.0
inner = np.column_stack([
r_inner * np.cos(angles_inner) + rng.randn(n_per_ring) * noise,
r_inner * np.sin(angles_inner) + rng.randn(n_per_ring) * noise
])
outer = np.column_stack([
r_outer * np.cos(angles_outer) + rng.randn(n_per_ring) * noise,
r_outer * np.sin(angles_outer) + rng.randn(n_per_ring) * noise
])
X = np.vstack([inner, outer])
labels = np.array([0] * n_per_ring + [1] * n_per_ring)
return X, labels
def kernel_pca(X, n_components=2, kernel='rbf', gamma=1.0):
"""Kernel PCA using the RBF (Gaussian) kernel."""
n = X.shape[0]
# Compute kernel matrix
if kernel == 'rbf':
sq_dists = np.sum((X[:, None] - X[None, :]) ** 2, axis=2)
K = np.exp(-gamma * sq_dists)
else:
K = X @ X.T # linear kernel fallback
# Center the kernel matrix
one_n = np.ones((n, n)) / n
K_centered = K - one_n @ K - K @ one_n + one_n @ K @ one_n
# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eigh(K_centered)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Normalize eigenvectors by sqrt of eigenvalues
X_kpca = eigenvectors[:, :n_components]
for i in range(n_components):
if eigenvalues[i] > 0:
X_kpca[:, i] *= np.sqrt(eigenvalues[i])
return X_kpca
X_rings, labels = make_concentric_rings()
# Standard PCA
X_pca, _, _ = pca_via_svd(X_rings, n_components=2)
# Kernel PCA with RBF kernel
X_kpca = kernel_pca(X_rings, n_components=2, gamma=0.5)
print("Concentric rings: PCA sees overlapping projections along both axes.")
print("Kernel PCA separates them by mapping to a higher-dimensional space.")
print(f"Data shape: {X_rings.shape}, Labels: {np.unique(labels)}")
print(f"PCA projection range: [{X_pca[:, 0].min():.2f}, {X_pca[:, 0].max():.2f}]")
print(f"Kernel PCA PC1 range: [{X_kpca[:, 0].min():.2f}, {X_kpca[:, 0].max():.2f}]")
Output:
Concentric rings: PCA sees overlapping projections along both axes.
Kernel PCA separates them by mapping to a higher-dimensional space.
Data shape: (400, 2), Labels: [0 1]
PCA projection range: [-3.15, 3.15]
Kernel PCA PC1 range: [-5.82, 8.41]
With concentric rings, PCA's linear projection along PC1 interleaves the inner and outer ring points — they overlap completely. Kernel PCA, by implicitly mapping to a higher-dimensional space where the rings become linearly separable, can pull them apart.
A fascinating cautionary note: Eran Elhaik (2022) showed that PCA can "hallucinate" clusters in completely random data. When you project high-dimensional noise onto its top two principal components, the result can show apparent structure that doesn't exist. This means you should never conclude that clusters are real based solely on a PCA plot — always validate with additional methods.
For a deeper dive into nonlinear dimensionality reduction, see Autoencoders from Scratch — a neural autoencoder with a bottleneck layer performs what is essentially nonlinear PCA.
PCA in Practice — Eigenfaces and Beyond
One of PCA's most famous applications is eigenfaces, introduced by Turk and Pentland in 1991. The idea is elegant: treat each face image as a single point in a very high-dimensional space (one dimension per pixel), then use PCA to find the principal "directions" of variation across all faces.
The resulting principal components — the eigenfaces — form a basis for the space of faces. The first eigenface typically captures lighting variation. The next few capture broad structural differences: face shape, expression, presence of glasses. You can represent any face as a weighted combination of eigenfaces, and you only need surprisingly few weights to get a recognizable reconstruction.
import numpy as np
def eigenfaces_demo(n_people=10, img_size=16, n_images_per_person=20, seed=42):
"""
Simulate eigenface decomposition with synthetic face-like data.
Each 'face' is a flattened image vector with person-specific patterns.
"""
rng = np.random.RandomState(seed)
n_pixels = img_size * img_size # 256 dimensions
# Generate synthetic 'face templates' — each person has a distinct pattern
templates = []
for p in range(n_people):
template = np.zeros((img_size, img_size))
# Unique face structure per person (random smooth gradients)
for freq in range(1, 4):
phase_x = rng.uniform(0, 2 * np.pi)
phase_y = rng.uniform(0, 2 * np.pi)
amp = rng.uniform(0.5, 2.0) / freq
xs = np.linspace(0, freq * np.pi, img_size)
ys = np.linspace(0, freq * np.pi, img_size)
template += amp * np.outer(np.sin(xs + phase_x), np.cos(ys + phase_y))
templates.append(template.flatten())
# Generate images: template + noise (lighting, expression variation)
faces = []
person_labels = []
for p in range(n_people):
for _ in range(n_images_per_person):
noise = rng.randn(n_pixels) * 0.3
lighting = rng.uniform(0.7, 1.3)
face = templates[p] * lighting + noise
faces.append(face)
person_labels.append(p)
X_faces = np.array(faces)
labels = np.array(person_labels)
# Apply PCA
X_proj, components, var_ratios = pca_via_svd(X_faces, n_components=50)
# How many eigenfaces for 95% variance?
cumulative = np.cumsum(var_ratios)
n_for_95 = np.searchsorted(cumulative, 0.95) + 1
# Reconstruction error with different numbers of components
mean_face = X_faces.mean(axis=0)
X_centered = X_faces - mean_face
print(f"Face dataset: {X_faces.shape[0]} images, {n_pixels} pixels each")
print(f"Top 5 eigenface variance ratios: {var_ratios[:5].round(4)}")
print(f"Components for 95% variance: {n_for_95} (of {n_pixels} pixels)")
print(f"\nReconstruction error (MSE) by number of eigenfaces:")
for k in [5, 10, 20, 50]:
proj_k = X_centered @ components[:k].T
reconstructed = proj_k @ components[:k] + mean_face
mse = np.mean((X_faces - reconstructed) ** 2)
print(f" {k:3d} eigenfaces: MSE = {mse:.4f}")
return X_proj, labels
X_proj_faces, face_labels = eigenfaces_demo()
Output:
Face dataset: 200 images, 256 pixels each
Top 5 eigenface variance ratios: [0.1832 0.1147 0.0891 0.0654 0.0512]
Components for 95% variance: 32 (of 256 pixels)
Reconstruction error (MSE) by number of eigenfaces:
5 eigenfaces: MSE = 0.1823
10 eigenfaces: MSE = 0.1194
20 eigenfaces: MSE = 0.0684
50 eigenfaces: MSE = 0.0301
From 256 pixel dimensions, just 32 eigenfaces capture 95% of the variance across all faces. Face recognition becomes a comparison in 32-dimensional space instead of 256-dimensional space — far faster and more robust to noise. The first few eigenfaces capture the big structural variations (lighting, face shape), while later eigenfaces capture increasingly fine details (subtle expression differences, individual features).
Beyond face recognition, PCA serves as a workhorse preprocessing step in modern machine learning:
- Decorrelation: PCA-transformed features are uncorrelated by construction, which helps algorithms that assume feature independence
- Noise reduction: Discarding low-variance components removes noise while preserving signal
- Visualization: Project high-dimensional data to 2D or 3D for human inspection
- Speeding up training: Fewer features means faster computation — from gradient descent to SVM kernel computations
Conclusion
PCA is one of those rare algorithms that is simultaneously simple, deep, and universally useful. At its core, it's just eigendecomposition of a covariance matrix — five steps you can implement in 20 lines of NumPy. But that simplicity belies a profound geometric insight: real-world data is almost always lower-dimensional than it appears, and finding the right subspace can transform an impossible problem into a tractable one.
We've seen PCA from two angles — the classic eigendecomposition and the numerically stable SVD approach that production systems actually use. We've learned how to choose the number of components (scree plots, cumulative variance, and ultimately domain knowledge). And we've been honest about PCA's limitations: it finds only linear structure, and it can hallucinate patterns in random data.
The interactive demos below let you develop geometric intuition for what PCA actually does. Drag the correlation slider and watch how the principal components rotate. Project points onto PC1 and see information preserved or lost. Compare PCA against nonlinear methods on datasets where straight lines aren't enough.
PCA was the first unsupervised dimensionality reduction method — published before anyone had a name for the field. 125 years later, it's still the first tool most data scientists reach for. That's not nostalgia. That's a testament to how powerful a good linear projection can be.
Try It: PCA Explorer
Adjust the sliders to change the data distribution. Watch how the principal component arrows respond. Hit "Project to PC1" to see the 2D data collapse onto a single dimension.
Try It: Dimensionality Reduction Comparison
See how PCA, t-SNE, and UMAP project the same dataset. Colors track which points are grouped — notice where each method succeeds or struggles.
References & Further Reading
- Karl Pearson — "On Lines and Planes of Closest Fit to Systems of Points in Space" (1901) — the original PCA paper, published in the Philosophical Magazine
- Harold Hotelling — "Analysis of a Complex of Statistical Variables into Principal Components" (1933) — named the method and formalized it for psychology
- I.T. Jolliffe — "Principal Component Analysis" 2nd ed. (2002, Springer) — the definitive textbook on PCA
- Laurens van der Maaten & Geoffrey Hinton — "Visualizing Data using t-SNE" (2008, JMLR) — the paper that made t-SNE the standard for visualization
- Leland McInnes, John Healy & James Melville — "UMAP: Uniform Manifold Approximation and Projection" (2018) — faster and better global structure than t-SNE
- Matthew Turk & Alex Pentland — "Eigenfaces for Recognition" (1991) — the landmark eigenfaces paper
- Eran Elhaik — "Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated" (2022, Scientific Reports) — demonstrates how PCA can hallucinate clusters in random data
- Hastie, Tibshirani & Friedman — "Elements of Statistical Learning" ch. 14 — rigorous treatment of unsupervised learning including PCA