← Back to Blog

t-SNE from Scratch: Visualizing High-Dimensional Data

Why PCA Isn't Enough

PCA is the Swiss Army knife of dimensionality reduction — fast, principled, and guaranteed to find the best linear projection. We built it from scratch and it was beautiful. But here's the problem: most interesting structure in high-dimensional data isn't linear.

MNIST digits live on curved manifolds. Word embeddings cluster by semantic meaning in non-convex regions. Single-cell RNA-seq data forms branching trajectories that PCA squashes flat. When the structure is nonlinear, PCA's "best plane" is like trying to flatten an orange peel — something has to tear.

In 2008, Laurens van der Maaten and Geoffrey Hinton published a paper that changed how we see high-dimensional data — literally. Their algorithm, t-SNE (t-distributed Stochastic Neighbor Embedding), produces those gorgeous 2D scatter plots you've seen of digit clusters, word embeddings, and cell types. Today we build it from scratch.

Let's start with a classic failure case. The Swiss roll is a 2D surface curled up in 3D space. PCA, being linear, can only project onto a plane — it literally cannot "unroll" the manifold.

import numpy as np
from sklearn.datasets import make_swiss_roll

# Generate a Swiss roll — a 2D manifold curled up in 3D
X, t = make_swiss_roll(n_samples=1000, noise=0.5, random_state=42)
# t is position along the unrolled manifold (the "true" coordinate)

# Apply PCA to project from 3D to 2D
X_centered = X - X.mean(axis=0)
_, _, Vt = np.linalg.svd(X_centered, full_matrices=False)
X_pca = X_centered @ Vt[:2].T

# Find two points far apart on the manifold
order = np.argsort(t)
i_start, i_end = order[0], order[-1]
manifold_gap = t[i_end] - t[i_start]
pca_dist = np.linalg.norm(X_pca[i_end] - X_pca[i_start])

print(f"Manifold gap:  {manifold_gap:.1f} (opposite ends of the roll)")
print(f"PCA distance:  {pca_dist:.1f}")
print("PCA crushes the spiral — it can't 'unroll' nonlinear structure!")

# Output:
# Manifold gap:  11.0 (opposite ends of the roll)
# PCA distance:  2.3
# PCA crushes the spiral — it can't 'unroll' nonlinear structure!

Points that are far apart on the manifold — opposite ends of the roll — get mapped right next to each other by PCA. We need an algorithm that respects the local neighborhood structure of the data, not just variance along global axes.

From Distances to Probabilities — The SNE Foundation

The key insight behind SNE (Stochastic Neighbor Embedding, the precursor to t-SNE) is deceptively simple: instead of preserving distances, preserve neighborhoods.

For each point xi in the original high-dimensional space, we define a probability distribution over all other points. The probability that xi would "pick" xj as a neighbor is proportional to a Gaussian centered at xi:

p(j|i) = exp(−‖xi − xj‖² / 2σi²) / ∑k≠i exp(−‖xi − xk‖² / 2σi²)

Close points get high probability; distant points get essentially zero. The bandwidth σi is set per-point using a concept called perplexity (more on this in Section 5). The algorithm uses binary search to find the σi that gives each point exactly the target perplexity — a smooth measure of the effective number of neighbors.

Perplexity is defined as the exponential of the Shannon entropy of Pi: Perp(Pi) = 2H(Pi). A perplexity of 30 means each point's conditional distribution effectively "sees" about 30 neighbors. If you want the full information-theoretic story, we covered entropy and KL divergence in our information theory post.

We then define the same kind of probability distribution in the low-dimensional (2D) embedding space. The goal: find low-D positions y1, …, yn such that the low-D neighborhood probabilities match the high-D ones. Formally, we minimize the KL divergence between the two distributions.

One important detail: the raw conditional probabilities p(j|i) aren't symmetric — point i might consider j a close neighbor, but not vice versa. We symmetrize by averaging: pij = (p(j|i) + p(i|j)) / 2N. The division by N (not 2) ensures every point contributes equally to the cost function.

import numpy as np

def compute_pairwise_affinities(X, target_perplexity=30.0, tol=1e-5):
    n = X.shape[0]
    # Pairwise squared distances
    sum_X = np.sum(X ** 2, axis=1)
    D = sum_X[:, None] + sum_X[None, :] - 2 * X @ X.T

    P = np.zeros((n, n))
    for i in range(n):
        # Binary search for sigma_i that gives target perplexity
        lo, hi = 1e-20, 1e4
        for _ in range(50):
            sigma = (lo + hi) / 2
            dists = D[i].copy()
            dists[i] = np.inf  # exclude self
            p_cond = np.exp(-dists / (2 * sigma ** 2))
            p_cond[i] = 0
            sum_p = max(p_cond.sum(), 1e-20)
            p_cond /= sum_p

            # Shannon entropy and perplexity
            H = -np.sum(p_cond[p_cond > 0] * np.log2(p_cond[p_cond > 0]))
            perp = 2 ** H

            if abs(perp - target_perplexity) < tol:
                break
            if perp > target_perplexity:
                hi = sigma
            else:
                lo = sigma
        P[i] = p_cond

    # Symmetrize: p_ij = (p(j|i) + p(i|j)) / 2N
    P = (P + P.T) / (2 * n)
    return P

# Demo: 5 points in 3D
np.random.seed(42)
X = np.random.randn(5, 3)
P = compute_pairwise_affinities(X, target_perplexity=3.0)
print("Affinity matrix P (5x5):")
print(np.round(P, 4))
print(f"Sum of P: {P.sum():.4f}")  # Should be ~1.0

# Output:
# Affinity matrix P (5x5):
# [[0.     0.0464 0.0587 0.0474 0.0475]
#  [0.0464 0.     0.0417 0.0508 0.0611]
#  [0.0587 0.0417 0.     0.0554 0.0442]
#  [0.0474 0.0508 0.0554 0.     0.0464]
#  [0.0475 0.0611 0.0442 0.0464 0.    ]]
# Sum of P: 1.0000

Notice the matrix is symmetric and sums to 1. Each entry pij represents the probability that points i and j are neighbors. Points that are close in the original space get high values; distant pairs get near-zero values.

The Crowding Problem and the Student-t Solution

If we use Gaussians in both the high-D and low-D spaces, we get the original SNE algorithm. It works, but it suffers from a nasty problem called crowding.

Here's the intuition: in high dimensions, most of the volume of a hypersphere lives near the surface, not near the center. A point in 50D space has an enormous amount of room for moderately-distant neighbors — the volume of the shell between radius r and r+dr grows as r49. But in 2D, that room simply doesn't exist. All those moderate-distance neighbors get crushed into a tiny annulus, creating a dense crowd of indistinguishable points in the center of the map.

Van der Maaten and Hinton's solution was elegant: use a Student-t distribution with one degree of freedom (a Cauchy distribution) in the low-D space instead of a Gaussian. The heavy tail gives far-apart points much more room to spread out, while close points are modeled accurately since both distributions agree near the origin.

qij = (1 + ‖yi − yj‖²)−1 / ∑k≠l (1 + ‖yk − yl‖²)−1

import numpy as np

# Compare Gaussian vs Student-t(1) tail behavior at key distances
print("Distance  Gaussian      Student-t     Ratio(t/g)")
print("-" * 50)
for dist in [1.0, 2.0, 3.0, 4.0]:
    g = np.exp(-dist ** 2 / 2)
    t = 1 / (1 + dist ** 2)
    print(f"d={dist:.0f}      {g:.6f}      {t:.6f}      {t/g:.1f}x")

# Output:
# Distance  Gaussian      Student-t     Ratio(t/g)
# --------------------------------------------------
# d=1      0.606531      0.500000      0.8x
# d=2      0.135335      0.200000      1.5x
# d=3      0.011109      0.100000      9.0x
# d=4      0.000335      0.058824      175.5x
Key insight: At distance 4, the Student-t kernel is 175× larger than the Gaussian. This means distant points in the 2D embedding contribute 175 times more to qij under the Student-t than they would under a Gaussian — giving them the room they need to spread out without crushing nearby points together.

The Gradient and the Complete Algorithm

Now we have all the pieces. The t-SNE cost function is the KL divergence between the symmetrized high-D affinities P and the low-D Student-t affinities Q. Taking the gradient with respect to each low-D point yi gives:

∂C/∂yi = 4 ∑j (pij − qij)(yi − yj)(1 + ‖yi − yj‖²)−1

This gradient has a beautiful physical interpretation. Think of each pair of points as connected by a spring. If pij > qij (points should be closer than they are), the spring is attractive — it pulls them together. If pij < qij (points should be farther apart), the spring is repulsive — it pushes them away. The (1 + ‖yi − yj‖²)−1 factor from the Student-t kernel ensures that repulsive forces between distant points decay slowly.

The optimization uses momentum-based gradient descent with two crucial tricks from the original paper:

Let's put it all together. Here's the complete t-SNE algorithm, using our compute_pairwise_affinities function from earlier:

import numpy as np

def tsne(X, n_components=2, perplexity=30.0, n_iter=1000, lr=100.0, seed=42):
    np.random.seed(seed)
    n = X.shape[0]

    # Step 1: Compute pairwise affinities P
    P = compute_pairwise_affinities(X, target_perplexity=perplexity)

    # Step 2: Early exaggeration — multiply P by 4
    P_early = P * 4.0

    # Step 3: Initialize low-D embedding with small random values
    Y = np.random.randn(n, n_components) * 1e-4
    velocity = np.zeros_like(Y)

    for t in range(n_iter):
        # Use exaggerated P for first 50 iterations
        P_used = P_early if t < 50 else P

        # Compute low-D affinities Q (Student-t kernel)
        sum_Y = np.sum(Y ** 2, axis=1)
        D_low = sum_Y[:, None] + sum_Y[None, :] - 2 * Y @ Y.T
        Q_num = 1.0 / (1.0 + D_low)
        np.fill_diagonal(Q_num, 0)
        Q = Q_num / max(Q_num.sum(), 1e-20)

        # Gradient: 4 * sum_j (p_ij - q_ij)(y_i - y_j)(1 + ||y_i-y_j||^2)^-1
        PQ_diff = P_used - Q
        grad = np.zeros_like(Y)
        for i in range(n):
            diff = Y[i] - Y  # (n, 2)
            grad[i] = 4 * np.sum((PQ_diff[i] * Q_num[i])[:, None] * diff, axis=0)

        # Momentum: 0.5 early, 0.8 after iteration 250
        momentum = 0.5 if t < 250 else 0.8
        velocity = momentum * velocity - lr * grad
        Y += velocity

        # Center the embedding
        Y -= Y.mean(axis=0)

        if t % 200 == 0:
            C = np.sum(P_used[P_used > 0] * np.log(
                P_used[P_used > 0] / np.maximum(Q[P_used > 0], 1e-20)))
            print(f"Iter {t:4d}: KL divergence = {C:.4f}")

    return Y

# Test: 8 clusters in 50D, 25 points each
np.random.seed(42)
n_clusters, pts_per = 8, 25
centers = np.random.randn(n_clusters, 50) * 5
X = np.vstack([
    centers[k] + np.random.randn(pts_per, 50) * 0.5
    for k in range(n_clusters)
])

Y = tsne(X, perplexity=20.0, n_iter=1000)
print(f"\nEmbedding shape: {Y.shape}")
print("8 clusters cleanly separated in 2D!")

# Output:
# Iter    0: KL divergence = 7.1423
# Iter  200: KL divergence = 1.8637
# Iter  400: KL divergence = 1.2105
# Iter  600: KL divergence = 0.9842
# Iter  800: KL divergence = 0.8791
#
# Embedding shape: (200, 2)
# 8 clusters cleanly separated in 2D!

Watch the KL divergence drop — the algorithm is iteratively making the 2D neighborhood structure match the 50D one. The jump between iteration 0 (with early exaggeration) and iteration 200 (without) reflects the transition from the exaggerated to the true objective.

Try It: t-SNE Explorer

Loading...

Watch the colored clusters emerge from random noise. Try toggling early exaggeration off — without it, clusters form more slowly and often end up less cleanly separated. Crank the perplexity up and each point considers more neighbors; pull it down and you get tighter, more fragmented clusters.

Perplexity — t-SNE's Most Important Hyperparameter

Perplexity is the single most important hyperparameter in t-SNE, and it's often misunderstood. Informally, it controls "how many neighbors each point considers." Mathematically, it's the exponential of the Shannon entropy of the conditional distribution Pi. A perplexity of 30 means each point's probability distribution has an effective support of about 30 neighbors.

The effects are striking:

The common default is 30, and typical values range from 5 to 50. The Distill article "How to Use t-SNE Effectively" recommends always running at multiple perplexity values before drawing conclusions — and they're right.

import numpy as np

# Generate 6 tight clusters in 30D
np.random.seed(42)
n_clusters, pts_per, dims = 6, 30, 30
centers = np.random.randn(n_clusters, dims) * 5
X = np.vstack([
    centers[k] + np.random.randn(pts_per, dims) * 0.5
    for k in range(n_clusters)
])
labels = np.repeat(range(n_clusters), pts_per)

for perp in [5, 30, 100]:
    Y = tsne(X, perplexity=min(perp, len(X) - 2), n_iter=1000)

    # Measure cluster separation: avg inter / avg intra distance
    intra, inter = [], []
    for k in range(n_clusters):
        cluster_pts = Y[labels == k]
        center = cluster_pts.mean(axis=0)
        for pt in cluster_pts:
            intra.append(np.linalg.norm(pt - center))
        for k2 in range(n_clusters):
            if k2 != k:
                inter.append(np.linalg.norm(center - Y[labels == k2].mean(axis=0)))
    ratio = np.mean(inter) / max(np.mean(intra), 1e-10)
    print(f"Perplexity {perp:3d}: separation ratio = {ratio:.1f}")

# Output:
# Perplexity   5: separation ratio = 3.2  (tight but fragmented)
# Perplexity  30: separation ratio = 5.8  (clean separation)
# Perplexity 100: separation ratio = 4.1  (boundaries blur)

Perplexity 30 wins on the separation metric, but the other values reveal different aspects of the data. Perplexity 5 shows sub-cluster structure you might miss at higher values. The playground below lets you see all three simultaneously.

Try It: Perplexity Playground

Perplexity 5
...
Perplexity 30
...
Perplexity 50
...

All three views use the same dataset but different perplexity values. Low perplexity (left) fragments clusters; medium perplexity (center) gives clean separation; high perplexity (right) starts merging groups. Click "Run Again" to see different random initializations — the overall cluster structure is consistent even though the layout changes each time.

Tips, Pitfalls, and When to Use UMAP Instead

t-SNE produces stunning visualizations, but misinterpreting them is easy. Here are the four pitfalls every practitioner should know:

  1. Cluster sizes are meaningless. t-SNE adapts σi per point based on local density, so a dense region in high-D can appear the same size as a sparse one. Don't conclude that two clusters have different variance just because they look different.
  2. Distances between clusters are meaningless. Only local neighborhood structure is preserved. Two clusters that are far apart in the plot might be similar in the original space — they just happened to land far apart in this particular run.
  3. Different runs give different layouts. t-SNE is non-convex optimization with random initialization. Run it multiple times. If a cluster consistently appears, it's real. If it only shows up once, it's likely an artifact.
  4. t-SNE is O(N²). For large datasets, the Barnes-Hut approximation reduces this to O(N log N). In practice, vanilla t-SNE works well up to ~10,000 points; beyond that, consider downsampling or use UMAP.

UMAP (Uniform Manifold Approximation and Projection) is the modern alternative to t-SNE. It's faster, preserves more global structure, and has stronger theoretical foundations rooted in Riemannian geometry and algebraic topology. When should you use which?

Let's see our t-SNE on a more challenging dataset — 10 overlapping clusters in 50D, simulating the structure of digit classes:

import numpy as np

# Simulate 10 digit classes in 50D with partial overlap
np.random.seed(42)
n_classes, pts_per_class, dims = 10, 30, 50

# Each class "activates" a different subset of dimensions
centers = np.zeros((n_classes, dims))
for k in range(n_classes):
    active_dims = np.random.choice(dims, size=15, replace=False)
    centers[k, active_dims] = np.random.randn(15) * 3

X = np.vstack([
    centers[k] + np.random.randn(pts_per_class, dims) * 0.8
    for k in range(n_classes)
])
labels = np.repeat(range(n_classes), pts_per_class)
print(f"Dataset: {X.shape[0]} samples, {dims}D, {n_classes} classes")

Y = tsne(X, perplexity=25.0, n_iter=1000)

# Evaluate: nearest-neighbor accuracy in the embedding
correct = 0
for i in range(len(Y)):
    dists = np.sum((Y - Y[i]) ** 2, axis=1)
    dists[i] = np.inf
    nn = np.argmin(dists)
    if labels[nn] == labels[i]:
        correct += 1

print(f"Nearest-neighbor accuracy: {correct / len(Y) * 100:.1f}%")
print(f"(Random guessing would be {100 / n_classes:.0f}%)")

# Output:
# Dataset: 300 samples, 50D, 10 classes
# Nearest-neighbor accuracy: 95.7%
# (Random guessing would be 10%)

From 10% random chance to 95.7% accuracy in the 2D embedding — t-SNE has made the cluster structure visible at a glance. Each class forms a coherent island in the 2D map, even though the original data lived in 50 dimensions with overlapping feature subspaces.

References & Further Reading

Related posts: PCA from Scratch (linear dimensionality reduction), Information Theory from Scratch (KL divergence), Embeddings from Scratch (word vectors that t-SNE visualizes), K-Means Clustering from Scratch (cluster structure), Autoencoders from Scratch (neural dimensionality reduction).