Autoencoders & VAEs from Scratch: How Neural Networks Learn to Compress and Imagine
The Compression Instinct
The diffusion post ended with a tantalizing hint: “Stable Diffusion doesn’t diffuse in pixel space at all. It first compresses images to a small latent space using a pre-trained autoencoder (VAE), does the diffusion there, then decodes the result back to pixels.” We handwaved over the most critical piece — where does that latent space come from?
The answer is an autoencoder: a neural network that learns to compress data through a bottleneck and reconstruct it on the other side. And its probabilistic sibling, the Variational Autoencoder (VAE), turns that bottleneck into a smooth, sampleable space where new data can be generated — not just reconstructed.
This distinction matters. A regular autoencoder is like a zip file — it compresses faithfully but can’t create new content. A VAE is like an artist who studied thousands of faces: it doesn’t just remember them, it understands the space of faces well enough to draw ones it’s never seen. That leap from compression to imagination is the subject of this post.
We’ll build both from scratch in Python and NumPy — vanilla autoencoders first, then VAEs — training on tiny 8×8 digit images so you can see every gradient flow and every latent dimension. By the end, you’ll understand exactly how Stable Diffusion’s VAE creates the 48× compressed latent space where diffusion happens.
The Simplest Autoencoder: Compress and Reconstruct
An autoencoder has two halves joined by a bottleneck. The encoder takes input x (say, a 64-pixel image flattened to a vector of length 64) and compresses it to a small latent vector z of length, say, 2. The decoder takes that tiny z and tries to reconstruct the original x. The entire network trains end-to-end by minimizing the reconstruction error: how different is the output from the input?
x̂ = decoder(z)
Loss = ‖x − x̂‖² (mean squared error)
The bottleneck is the key. If the latent dimension equals the input dimension, the network can just learn the identity function — pass everything through unchanged. But when you force 64 dimensions through a 2-dimensional bottleneck, the encoder must decide what information matters. It learns a compressed representation, and the decoder learns to reconstruct from that summary alone.
This is conceptually similar to PCA (principal component analysis), which also projects data to a lower-dimensional subspace that preserves maximum variance. A linear autoencoder with MSE loss will discover exactly the same subspace as PCA. But with nonlinear activations (ReLU, tanh), autoencoders can learn curved manifolds that PCA’s flat projections miss entirely.
import numpy as np
class Autoencoder:
"""Vanilla autoencoder: 64 → 32 → 2 → 32 → 64."""
def __init__(self, input_dim=64, hidden_dim=32, latent_dim=2):
scale = np.sqrt(2.0 / input_dim) # He initialization
# Encoder weights
self.W_enc1 = np.random.randn(input_dim, hidden_dim) * scale
self.b_enc1 = np.zeros(hidden_dim)
self.W_enc2 = np.random.randn(hidden_dim, latent_dim) * np.sqrt(2.0 / hidden_dim)
self.b_enc2 = np.zeros(latent_dim)
# Decoder weights
self.W_dec1 = np.random.randn(latent_dim, hidden_dim) * np.sqrt(2.0 / latent_dim)
self.b_dec1 = np.zeros(hidden_dim)
self.W_dec2 = np.random.randn(hidden_dim, input_dim) * np.sqrt(2.0 / hidden_dim)
self.b_dec2 = np.zeros(input_dim)
def encode(self, x):
"""x → hidden → latent."""
h = np.maximum(0, x @ self.W_enc1 + self.b_enc1) # ReLU
z = h @ self.W_enc2 + self.b_enc2 # linear (no activation)
return z, h # return h for backprop
def decode(self, z):
"""latent → hidden → reconstruction."""
h = np.maximum(0, z @ self.W_dec1 + self.b_dec1) # ReLU
x_hat = np.clip(h @ self.W_dec2 + self.b_dec2, 0, 1) # sigmoid-like clamp
return x_hat, h
def forward(self, x):
"""Full forward pass: encode then decode."""
z, enc_h = self.encode(x)
x_hat, dec_h = self.decode(z)
return x_hat, z, enc_h, dec_h
The architecture is symmetric: input → 32 hidden units → 2 latent dimensions → 32 hidden units → output. ReLU activations add nonlinearity, and the latent layer is linear (no activation) so the latent space isn’t artificially bounded.
Notice there’s no magic here. It’s just two small neural networks (the encoder and decoder) connected through a narrow bottleneck. The magic emerges from training.
Training: Learning to Reconstruct
Training is straightforward: feed in images, compute the reconstruction, measure the MSE loss, and backpropagate gradients through the entire encoder-decoder pipeline. We train on 8×8 grayscale digit images — the same tiny format used in the CNN and ViT posts, keeping everything CPU-friendly.
def train_autoencoder(model, data, epochs=200, lr=0.005):
"""Train with MSE loss and simple SGD + momentum."""
N = len(data)
velocity = {} # momentum buffers for each parameter
for name in ['W_enc1','b_enc1','W_enc2','b_enc2',
'W_dec1','b_dec1','W_dec2','b_dec2']:
velocity[name] = np.zeros_like(getattr(model, name))
for epoch in range(epochs):
# Shuffle training data
perm = np.random.permutation(N)
total_loss = 0.0
for i in perm:
x = data[i] # shape (64,) — one 8×8 image flattened
# Forward pass
x_hat, z, enc_h, dec_h = model.forward(x)
# MSE loss: (1/d) * sum((x - x_hat)^2)
diff = x_hat - x
loss = np.mean(diff ** 2)
total_loss += loss
# Backward pass (chain rule through decoder then encoder)
d_x_hat = 2.0 * diff / len(x)
# Decoder layer 2: x_hat = clip(dec_h @ W_dec2 + b_dec2)
# Gradient of clip: 1 where 0 < output < 1, else 0
mask_out = ((x_hat > 0) & (x_hat < 1)).astype(float)
d_pre_clip = d_x_hat * mask_out
d_W_dec2 = np.outer(dec_h, d_pre_clip)
d_b_dec2 = d_pre_clip
d_dec_h = d_pre_clip @ model.W_dec2.T
# Decoder layer 1: dec_h = relu(z @ W_dec1 + b_dec1)
d_dec_h *= (dec_h > 0).astype(float) # ReLU gradient
d_W_dec1 = np.outer(z, d_dec_h)
d_b_dec1 = d_dec_h
d_z = d_dec_h @ model.W_dec1.T
# Encoder layer 2: z = enc_h @ W_enc2 + b_enc2
d_W_enc2 = np.outer(enc_h, d_z)
d_b_enc2 = d_z
d_enc_h = d_z @ model.W_enc2.T
# Encoder layer 1: enc_h = relu(x @ W_enc1 + b_enc1)
d_enc_h *= (enc_h > 0).astype(float)
d_W_enc1 = np.outer(x, d_enc_h)
d_b_enc1 = d_enc_h
# SGD update with momentum (0.9)
grads = {'W_enc1': d_W_enc1, 'b_enc1': d_b_enc1,
'W_enc2': d_W_enc2, 'b_enc2': d_b_enc2,
'W_dec1': d_W_dec1, 'b_dec1': d_b_dec1,
'W_dec2': d_W_dec2, 'b_dec2': d_b_dec2}
for name, grad in grads.items():
velocity[name] = 0.9 * velocity[name] - lr * grad
param = getattr(model, name)
setattr(model, name, param + velocity[name])
if epoch % 50 == 0:
print(f"Epoch {epoch:3d} Loss: {total_loss/N:.6f}")
# ae = Autoencoder(input_dim=64, latent_dim=2)
# train_autoencoder(ae, digit_images)
# → Epoch 0 Loss: 0.081432
# → Epoch 50 Loss: 0.014271
# → Epoch 100 Loss: 0.009856
# → Epoch 150 Loss: 0.008103
The loss drops steadily. After 200 epochs, the autoencoder reconstructs digits with reasonable fidelity — not perfect (we did compress 64 dimensions to 2), but recognizable. The strokes, curves, and rough shapes survive the bottleneck. What doesn’t survive is fine detail: individual pixel variations get smoothed out because the 2D bottleneck simply can’t encode them.
This tradeoff between compression and fidelity is governed by the bottleneck dimension. With 2 latent dims, we lose a lot but gain the ability to visualize the latent space. With 16 dims, reconstructions would be nearly perfect, but we’d lose the 2D visualization. Production autoencoders (like Stable Diffusion’s) use much higher dimensions — but the principle is identical.
What Lives in the Latent Space?
Here’s where it gets interesting. After training, we can encode every image in our dataset and plot the 2D latent vectors. What emerges?
Clusters. Each digit type (0, 1, 2, …, 9) forms its own cluster in the latent space. The network was never told which digit is which — it received no labels — yet it discovered that different digit shapes should map to different regions. This is unsupervised representation learning: the compression objective alone forces the network to find meaningful structure.
But look between the clusters. Those gaps are the problem. If you pick a point in one of those empty regions and try to decode it, you get garbage — blurry noise, half-formed shapes, artifacts. The decoder was never trained on inputs from those regions because no encoded data point ever landed there.
A vanilla autoencoder is like a map with labeled cities but empty wilderness between them. You can navigate to any city (reconstruct any training example), but try to visit the wilderness (sample a random point) and you’re lost.
This is the fundamental limitation: vanilla autoencoders are excellent at compression but useless for generation. You can’t sample new data by picking random latent points because most of the latent space is meaningless void. The decoder only produces sensible output for the specific points where training data was encoded.
We need a way to make the entire latent space meaningful — to fill in the gaps, smooth the boundaries, and turn the latent space from a sparse scattering of islands into a continuous landscape. Enter the Variational Autoencoder.
From Points to Distributions: The Variational Leap
The fix is both elegant and counterintuitive: add noise to the encoding process. Instead of mapping each input x to a single point z, the encoder outputs the parameters of a probability distribution — a mean μ and a log-variance log(σ²) — and then samples z from that distribution.
Sample: z = μ + σ ⊙ ε, where ε ∼ N(0, I)
(the “reparameterization trick”)
Why log-variance instead of variance directly? Numerical stability. The log-variance can be any real number (−∞ to +∞), while variance must be positive. The network outputs whatever it wants, and we exponentiate to get σ² = exp(log(σ²)).
The sampling z = μ + σ · ε is the famous reparameterization trick from Kingma and Welling’s 2013 paper. It’s what makes VAEs trainable. Naively, you can’t backpropagate through a random sampling operation — gradients don’t flow through randomness. But by rewriting the sampling as a deterministic function of μ, σ, and external noise ε, the randomness becomes an input rather than an operation. Gradients flow through μ and σ as usual, and ε is just a constant noise vector sampled separately.
class VAE:
"""Variational Autoencoder: 64 → 32 → (μ, logvar) → z → 32 → 64."""
def __init__(self, input_dim=64, hidden_dim=32, latent_dim=2):
scale = np.sqrt(2.0 / input_dim)
# Encoder: shared hidden layer, then two heads
self.W_enc1 = np.random.randn(input_dim, hidden_dim) * scale
self.b_enc1 = np.zeros(hidden_dim)
s2 = np.sqrt(2.0 / hidden_dim)
self.W_mu = np.random.randn(hidden_dim, latent_dim) * s2
self.b_mu = np.zeros(latent_dim)
self.W_logvar = np.random.randn(hidden_dim, latent_dim) * s2
self.b_logvar = np.zeros(latent_dim)
# Decoder (identical to vanilla AE)
self.W_dec1 = np.random.randn(latent_dim, hidden_dim) * np.sqrt(2.0 / latent_dim)
self.b_dec1 = np.zeros(hidden_dim)
self.W_dec2 = np.random.randn(hidden_dim, input_dim) * s2
self.b_dec2 = np.zeros(input_dim)
def encode(self, x):
"""x → hidden → (μ, log_var)."""
h = np.maximum(0, x @ self.W_enc1 + self.b_enc1)
mu = h @ self.W_mu + self.b_mu
log_var = h @ self.W_logvar + self.b_logvar
return mu, log_var, h
def reparameterize(self, mu, log_var):
"""Sample z = μ + σ * ε, where ε ~ N(0,1)."""
std = np.exp(0.5 * log_var) # σ = exp(log_var / 2)
eps = np.random.randn(*mu.shape) # ε ~ N(0,1)
z = mu + std * eps
return z, eps
def decode(self, z):
"""z → hidden → reconstruction."""
h = np.maximum(0, z @ self.W_dec1 + self.b_dec1)
x_hat = 1.0 / (1.0 + np.exp(-(h @ self.W_dec2 + self.b_dec2))) # sigmoid
return x_hat, h
def forward(self, x):
"""Encode → sample → decode."""
mu, log_var, enc_h = self.encode(x)
z, eps = self.reparameterize(mu, log_var)
x_hat, dec_h = self.decode(z)
return x_hat, mu, log_var, z, eps, enc_h, dec_h
Compare this to the vanilla autoencoder. The encoder now has two output heads (W_mu and W_logvar) instead of one. The decoder is identical. The only structural difference is that sampling step in the middle. And the decoder uses sigmoid instead of clipping, since we’ll switch to binary cross-entropy loss which expects probabilities.
But this one change transforms everything. By making encoding stochastic, we force the decoder to handle a cloud of possible z values for each input, not just a single point. Nearby points in the latent space must produce similar reconstructions — because the noise means the decoder sees them anyway. The gaps between clusters start to fill in.
The ELBO: Reconstruction Meets Regularization
A VAE’s loss function has two terms that pull in opposite directions, and the tension between them creates the entire magic:
= − Σ [xi log(x̂i) + (1−xi) log(1−x̂i)] + ½ Σ (μ² + σ² − log(σ²) − 1)
The reconstruction loss (binary cross-entropy, since our pixel values are between 0 and 1) measures how faithfully the decoder rebuilds the input. This is the same objective as the vanilla autoencoder — it pulls the network toward perfect reconstruction.
The KL divergence measures how far the encoder’s output distribution q(z|x) deviates from a standard Gaussian prior p(z) = N(0, I). It has a beautiful closed form for two Gaussians:
This term penalizes the encoder for deviating from a standard Gaussian. If the encoder tries to push all “3” digits to μ = (5, 5) with tiny variance, the KL loss says “too far from the origin, too narrow — spread out!” If the encoder sets σ very large, the KL loss says “too wide — tighten up toward unit variance!”
The result is a tug-of-war. The reconstruction term wants precision — tight clusters, distinct means, small variances. The KL term wants regularity — everything centered near the origin with unit variance. The compromise produces a latent space that is both meaningful (similar inputs map nearby) and continuous (the space between clusters is smoothly filled).
def vae_loss(x, x_hat, mu, log_var):
"""Compute the VAE loss: reconstruction + KL divergence."""
# Reconstruction loss (binary cross-entropy)
eps = 1e-8 # numerical stability
bce = -np.sum(x * np.log(x_hat + eps) + (1 - x) * np.log(1 - x_hat + eps))
# KL divergence: -0.5 * sum(1 + log(σ²) - μ² - σ²)
kl = -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))
return bce + kl, bce, kl
That’s the entire loss function in five lines. The sum of these two terms is called the ELBO (Evidence Lower Bound) — technically, we’re maximizing a lower bound on the log-likelihood log p(x) of the data. Minimizing our loss is equivalent to maximizing the ELBO, which means the VAE is doing principled probabilistic inference, not just ad-hoc compression.
Training the VAE: Watching Generation Emerge
The training loop is similar to the vanilla autoencoder, but we now backpropagate through the reparameterization trick and include the KL term in our gradients:
def train_vae(model, data, epochs=300, lr=0.003):
"""Train VAE with BCE + KL loss."""
N = len(data)
velocity = {}
param_names = ['W_enc1','b_enc1','W_mu','b_mu','W_logvar','b_logvar',
'W_dec1','b_dec1','W_dec2','b_dec2']
for name in param_names:
velocity[name] = np.zeros_like(getattr(model, name))
for epoch in range(epochs):
perm = np.random.permutation(N)
total_loss, total_bce, total_kl = 0., 0., 0.
for i in perm:
x = data[i]
x_hat, mu, log_var, z, eps, enc_h, dec_h = model.forward(x)
loss, bce, kl = vae_loss(x, x_hat, mu, log_var)
total_loss += loss; total_bce += bce; total_kl += kl
# ── Backward pass ──
e = 1e-8
# Gradient of BCE w.r.t. x_hat
d_x_hat = -(x / (x_hat + e) - (1 - x) / (1 - x_hat + e))
# Decoder layer 2 (sigmoid output)
d_pre_sig = d_x_hat * x_hat * (1 - x_hat) # sigmoid derivative
d_W_dec2 = np.outer(dec_h, d_pre_sig)
d_b_dec2 = d_pre_sig
d_dec_h = d_pre_sig @ model.W_dec2.T
# Decoder layer 1 (ReLU)
d_dec_h *= (dec_h > 0).astype(float)
d_W_dec1 = np.outer(z, d_dec_h)
d_b_dec1 = d_dec_h
d_z = d_dec_h @ model.W_dec1.T
# Reparameterization: z = mu + std * eps
std = np.exp(0.5 * log_var)
d_mu = d_z.copy()
d_log_var_from_z = d_z * eps * 0.5 * std # chain rule through exp
# KL gradients
d_mu += mu # d(KL)/d(mu) = mu
d_log_var_kl = 0.5 * (np.exp(log_var) - 1) # d(KL)/d(logvar)
d_log_var = d_log_var_from_z + d_log_var_kl
# Encoder head: mu = enc_h @ W_mu + b_mu
d_W_mu = np.outer(enc_h, d_mu)
d_b_mu = d_mu
d_enc_h_from_mu = d_mu @ model.W_mu.T
# Encoder head: log_var = enc_h @ W_logvar + b_logvar
d_W_logvar = np.outer(enc_h, d_log_var)
d_b_logvar = d_log_var
d_enc_h_from_lv = d_log_var @ model.W_logvar.T
# Encoder layer 1 (ReLU)
d_enc_h = (d_enc_h_from_mu + d_enc_h_from_lv)
d_enc_h *= (enc_h > 0).astype(float)
d_W_enc1 = np.outer(x, d_enc_h)
d_b_enc1 = d_enc_h
# Update all parameters
grads = {'W_enc1': d_W_enc1, 'b_enc1': d_b_enc1,
'W_mu': d_W_mu, 'b_mu': d_b_mu,
'W_logvar': d_W_logvar, 'b_logvar': d_b_logvar,
'W_dec1': d_W_dec1, 'b_dec1': d_b_dec1,
'W_dec2': d_W_dec2, 'b_dec2': d_b_dec2}
for name, grad in grads.items():
grad_clipped = np.clip(grad, -1.0, 1.0) # gradient clipping
velocity[name] = 0.9 * velocity[name] - lr * grad_clipped
setattr(model, name, getattr(model, name) + velocity[name])
if epoch % 50 == 0:
print(f"Epoch {epoch:3d} Loss: {total_loss/N:.1f} "
f"BCE: {total_bce/N:.1f} KL: {total_kl/N:.1f}")
# vae = VAE(input_dim=64, latent_dim=2)
# train_vae(vae, digit_images)
# → Epoch 0 Loss: 48.2 BCE: 44.8 KL: 3.4
# → Epoch 50 Loss: 31.6 BCE: 28.9 KL: 2.7
# → Epoch 100 Loss: 28.4 BCE: 24.8 KL: 3.6
# → Epoch 150 Loss: 26.1 BCE: 22.0 KL: 4.1
# → Epoch 200 Loss: 25.3 BCE: 20.9 KL: 4.4
# → Epoch 250 Loss: 24.8 BCE: 20.2 KL: 4.6
Watch the training dynamics. The reconstruction loss (BCE) drops steadily as the decoder gets better at reconstructing. The KL divergence increases in the early epochs — this is the encoder learning to spread its representations away from the origin to create distinct regions for each digit class. It then stabilizes as the reconstruction and regularization objectives reach an equilibrium.
Now comes the moment of truth. Encode all the training images and plot them in the 2D latent space. Compare the vanilla autoencoder’s scattered clusters with the VAE’s smooth distribution:
- Vanilla AE: Tight, separated clusters with empty gaps. Points between clusters decode to noise.
- VAE: Overlapping, Gaussian-shaped blobs centered near the origin. Points everywhere decode to something recognizable. The transitions between digit types are smooth — a point halfway between the “3” region and the “8” region produces something that looks like a blend of both.
The VAE has filled in the wilderness. The entire latent space is now a continuous landscape of digit shapes.
Generation: Sampling New Data from Thin Air
This continuity has a powerful consequence: we can now generate new data. Sample a random point z ∼ N(0, 1), pass it through the decoder, and out comes a plausible digit image that the network never saw during training.
This works because the KL term pushed the encoder’s distribution toward N(0, I). When we sample from the same N(0, I) at generation time, we’re sampling from roughly the same distribution the decoder was trained on. There are no dead zones, no gaps, no out-of-distribution surprises.
We can also interpolate. Take two encoded points z1 (a “3”) and z2 (a “7”), and decode a series of points along the line between them: z = (1−t) · z1 + t · z2 for t from 0 to 1. The output smoothly morphs from a 3 to a 7, passing through intermediate shapes that look plausible at every step. No abrupt jumps, no garbage frames.
This is the fundamental difference. A vanilla autoencoder memorizes a codebook of specific points. A VAE learns the manifold — the underlying shape of the data in latent space — so it can decode any point in the space, not just the ones it was trained on.
β-VAE: Turning the Regularization Knob
What happens if we scale the KL term? Higgins et al. (2017) introduced the β-VAE, which simply multiplies the KL divergence by a coefficient β:
This one knob controls the entire character of the latent space:
- β < 1: Less regularization. The latent space becomes more like a vanilla autoencoder — sharper reconstructions, but gaps reappear between clusters and generation quality degrades.
- β = 1: Standard VAE. The balance Kingma and Welling intended.
- β > 1: Stronger regularization. Reconstructions become blurrier as the bottleneck is squeezed harder, but something remarkable happens: individual latent dimensions start to control interpretable factors. One dimension might control digit thickness, another controls tilt, a third controls loop size. This is disentanglement — the network discovers independent factors of variation.
def vae_loss_beta(x, x_hat, mu, log_var, beta=1.0):
"""VAE loss with configurable β for the KL term."""
eps = 1e-8
bce = -np.sum(x * np.log(x_hat + eps) + (1 - x) * np.log(1 - x_hat + eps))
kl = -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))
return bce + beta * kl, bce, kl
# β = 0.5: sharp reconstructions, poor generation, clustered latent space
# β = 1.0: standard VAE balance
# β = 4.0: blurry reconstructions, smooth generation, disentangled dims
# A practical trick: KL warmup
# Start with β = 0, increase to 1.0 over the first 50 epochs.
# This prevents "posterior collapse" — where the encoder learns to
# ignore the latent space entirely (setting σ → ∞ and μ → 0),
# making z pure noise and forcing the decoder to memorize everything.
The KL warmup trick mentioned in the code is crucial in practice. If you start training with full KL weight, the regularization can overwhelm the reconstruction signal before the encoder learns anything useful. The encoder “collapses” to always outputting the prior N(0, I) regardless of input, and the decoder becomes a one-size-fits-all generator that ignores z entirely. This is posterior collapse, the most common VAE training failure. Gradually ramping β from 0 to 1 over the first few epochs gives the autoencoder a head start on learning useful representations before the regularization kicks in.
One more variant worth knowing: VQ-VAE (van den Oord et al., 2017) replaces the continuous Gaussian latent with a discrete codebook. The encoder output is snapped to the nearest entry in a learned dictionary of vectors. This produces sharper reconstructions than standard VAEs (no blurriness from sampling noise) and was the architecture behind DALL·E 1’s image tokenizer and modern audio codecs like EnCodec. VQ-VAE turns images into sequences of discrete tokens — which can then be modeled by a transformer.
Why This Matters: The Latent Diffusion Connection
Everything we’ve built today comes together in one of the most influential architectures in modern AI: Stable Diffusion. Here’s how the pipeline works:
- VAE encoder compresses a 512×512×3 image to a 64×64×4 latent tensor. That’s a 48× compression.
- Diffusion model operates entirely in this latent space — adding noise, predicting noise, denoising. All 1,000 timesteps of the forward and reverse process happen on the small latent, not the huge pixel grid.
- VAE decoder inflates the denoised 64×64×4 latent back to 512×512×3 pixels.
Why not just run diffusion directly on pixels? Because self-attention (which the diffusion model uses internally) is O(n²) in the number of tokens. For a 512×512 image, that’s 262,144 pixels — the attention matrix would have 69 billion entries. After VAE compression, the diffusion model works with 64×64 = 4,096 tokens. Same self-attention, but 4,000× cheaper. This is why Stable Diffusion runs on consumer GPUs while pixel-space diffusion would require a datacenter.
The VAE is trained separately and then frozen. The diffusion model never touches the VAE weights. It only learns to navigate the compressed landscape the VAE created. This separation is elegant: the VAE handles perceptual compression (turning pixels into semantically meaningful features), and the diffusion model handles generation (learning the distribution of those features).
The next time you generate an image with Stable Diffusion or DALL·E, remember: the very first step is an autoencoder. The compression instinct we built today is what makes the entire generative pipeline tractable.
Connections to the Series
Autoencoders and VAEs connect to the series in surprising depth — they touch on representation learning, loss design, training stability, and the full generative pipeline:
- Embeddings: The autoencoder bottleneck is an embedding. Just as Word2Vec learns a dense vector for each token that captures semantic meaning, the encoder learns a dense vector for each image that captures visual meaning. Both are learned compression of high-dimensional inputs into low-dimensional representations.
- Loss Functions: The VAE introduces the first compound loss in this series — reconstruction plus KL divergence. Two fundamentally different objectives balanced against each other. The KL term is a regularizer with a principled probabilistic interpretation, unlike generic weight decay.
- Optimizers: Adam with KL warmup prevents posterior collapse. Gradually increasing β from 0 to 1 is a learning rate schedule for the regularization term — the same idea as learning rate warmup in transformers.
- Diffusion Models: Stable Diffusion’s entire architecture is VAE encoder → diffusion → VAE decoder. The 48× compression from VAE makes diffusion tractable on consumer hardware. Without this post, the diffusion post is incomplete.
- CNNs: In production, VAE encoders and decoders are convolutional networks. Our CNN building blocks become the encoder’s feature extractors and the decoder’s transposed convolutions. The U-Net architecture that diffusion replaced is itself a form of encoder-decoder network with skip connections.
- Feed-Forward Networks: Our dense encoder and decoder layers use the same expand-contract pattern as transformer FFN blocks — just reversed in the decoder (contract-expand). The bottleneck is the extreme case of the FFN’s dimension change.
- Normalization: BatchNorm in the encoder and decoder stabilizes training, especially important when the KL term creates competing gradient signals. Group normalization is preferred in modern VAEs (same as in diffusion U-Nets).
- Knowledge Distillation: Both are forms of learned compression. Autoencoders compress data into a bottleneck; distillation compresses knowledge into a smaller model. The information-theoretic framing is the same: minimize the gap between source and compressed representation.
- Quantization: VQ-VAE quantizes the latent space (continuous → discrete), just as model quantization maps continuous weights to discrete levels. Both trade precision for efficiency, and both use learned codebooks to minimize the quality loss.
- Softmax & Temperature: The generation temperature slider in VAE sampling parallels the temperature parameter in language model sampling. Lower temperature = safer, more average outputs. Higher temperature = more diverse, noisier outputs. The quality-diversity tradeoff appears everywhere.
- Micrograd: Backprop through the reparameterization trick is the chain rule applied cleverly. Gradients can’t flow through random sampling, but they can flow through μ + σ · ε because ε is treated as a constant. Same chain rule, creative application.
Try It: Explore the Latent Space
Panel 1: Latent Space Explorer — AE vs VAE
Click anywhere on the latent space (left) to decode that point into an 8×8 image (right). Toggle between Autoencoder and VAE to see the difference: AE has gaps that decode to garbage; VAE is smooth everywhere.
Latent Space (click to decode)
Decoded Image (8×8)
Panel 2: The β Tradeoff — Reconstruction vs Regularization
Drag β to see the tradeoff: low β gives sharp reconstructions but a gappy latent space; high β gives smooth generation but blurrier output.
Reconstructions (quality)
Random Generations (diversity)
Panel 3: Generation Gallery — Sampling New Digits
Sample random points from N(0,1) and decode them. Adjust temperature to control diversity.
References & Further Reading
- Kingma & Welling — “Auto-Encoding Variational Bayes” (2013) — The original VAE paper. Introduced the reparameterization trick and the ELBO framework that made deep generative models trainable.
- Doersch — “Tutorial on Variational Autoencoders” (2016) — One of the best pedagogical overviews. Clear notation and step-by-step derivations.
- Higgins et al. — “β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework” (2017) — Showed that scaling the KL term produces disentangled representations.
- van den Oord et al. — “Neural Discrete Representation Learning” (VQ-VAE, 2017) — Replaced the continuous Gaussian latent with a discrete codebook. The basis for DALL·E 1’s image tokenizer.
- Rombach et al. — “High-Resolution Image Synthesis with Latent Diffusion Models” (2022) — The Stable Diffusion paper. Showed that a pre-trained VAE + diffusion in latent space achieves state-of-the-art image generation at tractable cost.
- Bank, Koenigstein & Giryes — “Autoencoders” (2021) — Comprehensive survey covering vanilla, variational, denoising, sparse, and contractive autoencoders.
- Hinton & Salakhutdinov — “Reducing the Dimensionality of Data with Neural Networks” (2006) — Early deep autoencoder work that showed multi-layer encoders outperform PCA for dimensionality reduction.
- Rezende, Mohamed & Wierstra — “Stochastic Backpropagation and Approximate Inference in Deep Generative Models” (2014) — Concurrent work on variational inference with deep networks, developing similar ideas to Kingma & Welling.