Generative Adversarial Networks from Scratch: How Neural Networks Learned to Create by Competing
The Art of Counterfeiting
Imagine two people locked in an escalating contest. One is a counterfeiter, getting better at forging paintings every day. The other is a detective, getting better at spotting fakes. The counterfeiter never sees real paintings — they only learn from the detective’s verdicts: “fake” or “real.” The detective never learns to paint — they only learn to judge. And yet, through this adversarial dance, the counterfeiter eventually produces work indistinguishable from the masters.
This is the core idea behind Generative Adversarial Networks (GANs), introduced by Ian Goodfellow and colleagues in 2014. Where the autoencoders post built generative models that learn to compress and reconstruct, and the diffusion post built models that learn to denoise, GANs take a completely different path: they learn to compete. Two neural networks — a Generator and a Discriminator — are trained simultaneously in a minimax game, and from their rivalry, creation emerges.
GANs were the first deep learning method to produce photorealistic images. StyleGAN’s “This Person Does Not Exist” faces in 2019 made international headlines. For five years, GANs were synonymous with AI-generated imagery. Then diffusion models arrived and largely displaced them — but the adversarial training idea lives on in RLHF reward models, adversarial diffusion distillation, and the foundational intuition behind modern generative AI.
In this post, we’ll build GANs from scratch in Python and NumPy. We’ll start with the simplest possible GAN — a generator learning to match a 1D Gaussian — then scale to 2D point distributions, tackle the notorious training instabilities head-on, and implement the Wasserstein GAN that fixed them. By the end, you’ll understand exactly why GANs were revolutionary and exactly why the world moved on.
The Generator: Creating Something from Nothing
The Generator is a neural network that transforms random noise into data. It takes a sample z drawn from a simple distribution (typically a standard Gaussian, z ~ N(0, 1)) and passes it through learned layers to produce an output in data space. The generator never sees real data — it only receives gradient signals from the discriminator telling it whether its outputs looked convincing.
Think of it this way: the generator is a student taking an exam in a subject they’ve never studied. They can’t see the textbook (real data), can’t see other students’ answers (real samples), and can’t even see the questions (input features). All they get is a grade on each attempt. Through thousands of iterations of submitting answers and receiving grades, they learn to produce responses that earn high marks — without ever directly observing what a “correct” answer looks like.
For our from-scratch implementation, the generator is a simple two-layer MLP. The input noise dimension, hidden dimension, and output dimension are all configurable. We use ReLU activations in the hidden layer and no activation on the output (so the generator can produce any real-valued output).
import numpy as np
class Generator:
"""Maps random noise z to data space through a 2-layer MLP."""
def __init__(self, noise_dim=1, hidden_dim=64, output_dim=1):
# He initialization for ReLU networks
self.W1 = np.random.randn(noise_dim, hidden_dim) * np.sqrt(2.0 / noise_dim)
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(hidden_dim, output_dim) * np.sqrt(2.0 / hidden_dim)
self.b2 = np.zeros(output_dim)
def forward(self, z):
"""z -> hidden (ReLU) -> output (linear)."""
self.z = z # (batch, noise_dim)
self.h = np.maximum(0, z @ self.W1 + self.b1) # (batch, hidden_dim)
self.out = self.h @ self.W2 + self.b2 # (batch, output_dim)
return self.out
def backward(self, grad_out, lr=0.001):
"""Backprop through generator and update weights."""
# Gradient through output layer
grad_W2 = self.h.T @ grad_out # (hidden, output)
grad_b2 = grad_out.sum(axis=0)
grad_h = grad_out @ self.W2.T # (batch, hidden)
# Gradient through ReLU
grad_h = grad_h * (self.h > 0) # mask dead neurons
# Gradient through first layer
grad_W1 = self.z.T @ grad_h # (noise, hidden)
grad_b1 = grad_h.sum(axis=0)
# SGD update
self.W2 -= lr * grad_W2 / len(grad_out)
self.b2 -= lr * grad_b2 / len(grad_out)
self.W1 -= lr * grad_W1 / len(grad_out)
self.b1 -= lr * grad_b1 / len(grad_out)
Notice what’s not here: no loss function. The generator doesn’t compute its own loss. It receives gradients from the discriminator — the “grade” on its output — and backpropagates those through its own layers. The generator’s entire learning signal comes from someone else’s evaluation of its work.
The Discriminator: Spotting the Fakes
The Discriminator is a binary classifier. It takes a data sample — either real (from the training set) or fake (from the generator) — and outputs a probability that the sample is real. A sigmoid activation squashes the output to [0, 1], where 1 means “definitely real” and 0 means “definitely fake.”
The discriminator sees both sides of the coin. During each training step, it processes a batch of real samples (which it should classify as real) and a batch of generated samples (which it should classify as fake). Its loss is standard binary cross-entropy — the same loss you’d use for any binary classification task, the same loss we derived in the loss functions post.
class Discriminator:
"""Classifies samples as real (1) or fake (0) via a 2-layer MLP."""
def __init__(self, input_dim=1, hidden_dim=64):
self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim)
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(hidden_dim, 1) * np.sqrt(2.0 / hidden_dim)
self.b2 = np.zeros(1)
def forward(self, x):
"""x -> hidden (ReLU) -> probability (sigmoid)."""
self.x = x
self.h = np.maximum(0, x @ self.W1 + self.b1) # (batch, hidden)
logit = self.h @ self.W2 + self.b2 # (batch, 1)
self.prob = 1.0 / (1.0 + np.exp(-np.clip(logit, -20, 20))) # sigmoid
return self.prob
def backward(self, grad_prob, lr=0.001):
"""Backprop through discriminator and update weights."""
# Gradient through sigmoid: dsigmoid/dlogit = prob * (1 - prob)
grad_logit = grad_prob * self.prob * (1.0 - self.prob)
# Output layer
grad_W2 = self.h.T @ grad_logit
grad_b2 = grad_logit.sum(axis=0)
grad_h = grad_logit @ self.W2.T
# ReLU
grad_h = grad_h * (self.h > 0)
# First layer
grad_W1 = self.x.T @ grad_h
grad_b1 = grad_h.sum(axis=0)
# SGD update
self.W2 -= lr * grad_W2 / len(grad_prob)
self.b2 -= lr * grad_b2 / len(grad_prob)
self.W1 -= lr * grad_W1 / len(grad_prob)
self.b1 -= lr * grad_b1 / len(grad_prob)
# Return gradient w.r.t. input (needed for generator training)
grad_x = grad_h @ self.W1.T
return grad_x
The critical detail is the last line: return grad_x. When the discriminator processes a generated sample, the gradient with respect to the input tells the generator how to adjust its output to look more real. This gradient flows backward from the discriminator through the generated sample into the generator — it’s the only communication channel between the two networks.
The Adversarial Game: A Minimax Dance
Now for the beautiful part. The GAN training objective is a minimax game — the first time in this series where two networks are optimized against each other rather than toward a shared goal:
Read it in two halves. The discriminator wants to maximize this expression: make D(x) close to 1 for real data (so log D(x) is close to 0, its maximum) and D(G(z)) close to 0 for fake data (so log(1 - D(G(z))) is also close to 0). The generator wants to minimize it: make D(G(z)) close to 1, which drives log(1 - D(G(z))) toward -∞.
Goodfellow proved two remarkable theorems about this game. First, for any fixed generator, the optimal discriminator has a closed-form solution:
The optimal discriminator outputs the density ratio — the probability that a sample came from the real distribution rather than the generated one. When the generator perfectly matches the real data (pg = pdata), the optimal discriminator outputs 1/2 everywhere. Maximum confusion.
Second, substituting this optimal discriminator back into the objective reveals what the generator is actually minimizing:
The generator is minimizing the Jensen-Shannon Divergence between the real and generated distributions. JSD is symmetric and bounded between 0 and log(2) — it equals zero only when the two distributions are identical. So the global minimum of the GAN game is pg = pdata, with value −log(4).
But there’s a practical problem. Early in training, when the generator is terrible, D(G(z)) is near zero — the discriminator easily spots fakes. The gradient of log(1 - D(G(z))) near D(G(z)) = 0 is weak (magnitude ≈ 1), giving the generator a tepid learning signal. This is the non-saturating trick: instead of minimizing log(1 - D(G(z))), the generator maximizes log(D(G(z))). The gradient of log(x) near x = 0 is steep — 1/x → ∞ — giving strong gradients exactly when the generator needs them most. Same equilibrium, much faster convergence.
def discriminator_loss(d_real, d_fake):
"""Binary cross-entropy loss for the discriminator.
d_real: D(x) for real samples, d_fake: D(G(z)) for generated samples.
D wants d_real -> 1 and d_fake -> 0."""
eps = 1e-8 # numerical stability
loss_real = -np.mean(np.log(d_real + eps)) # E[-log D(x)]
loss_fake = -np.mean(np.log(1 - d_fake + eps)) # E[-log(1 - D(G(z)))]
# Gradients w.r.t. D's output probabilities
grad_real = -1.0 / (d_real + eps) # d/dp [-log(p)]
grad_fake = 1.0 / (1 - d_fake + eps) # d/dp [-log(1-p)]
return loss_real + loss_fake, grad_real, grad_fake
def generator_loss_nonsaturating(d_fake):
"""Non-saturating generator loss: maximize log D(G(z)).
Provides strong gradients even when D easily spots fakes."""
eps = 1e-8
loss = -np.mean(np.log(d_fake + eps)) # E[-log D(G(z))]
grad = -1.0 / (d_fake + eps) # steep near 0!
return loss, grad
Compare the two generator gradients. The original 1/(1 - D(G(z))) is about 1 when D(G(z)) is near 0 (weak signal). The non-saturating -1/D(G(z)) explodes toward −∞ when D(G(z)) is near 0 (strong signal). This single trick is used in virtually all modern GAN implementations.
Training a 1D GAN: Learning a Gaussian
Let’s train the simplest possible GAN: a generator that learns to produce samples from a target Gaussian distribution N(3, 0.5) — mean 3, standard deviation 0.5. The generator takes noise z ~ N(0, 1) and must learn to shift and scale it to match the target.
This is almost trivially simple — a linear generator could learn the exact mapping G(z) = 0.5z + 3 — but it’s the perfect testbed for understanding training dynamics. We can visualize the generated distribution evolving epoch by epoch, watch the discriminator’s confidence rise and fall, and see convergence (or failure) in real time.
# Training a 1D GAN: Generator learns to match N(3, 0.5)
G = Generator(noise_dim=1, hidden_dim=64, output_dim=1)
D = Discriminator(input_dim=1, hidden_dim=64)
target_mean, target_std = 3.0, 0.5
batch_size = 256
d_lr, g_lr = 0.002, 0.001 # D gets a higher learning rate
for epoch in range(2001):
# --- Train Discriminator ---
real_data = np.random.randn(batch_size, 1) * target_std + target_mean
noise = np.random.randn(batch_size, 1)
fake_data = G.forward(noise)
d_real = D.forward(real_data)
d_loss_r, grad_r, _ = discriminator_loss(d_real, np.zeros_like(d_real))
D.backward(grad_r, lr=d_lr)
d_fake = D.forward(fake_data)
d_loss_f, _, grad_f = discriminator_loss(np.ones_like(d_fake), d_fake)
D.backward(grad_f, lr=d_lr)
# --- Train Generator (non-saturating) ---
noise = np.random.randn(batch_size, 1)
fake_data = G.forward(noise)
d_fake = D.forward(fake_data)
g_loss, g_grad = generator_loss_nonsaturating(d_fake)
# Backprop: gradient flows D -> fake_data -> G
grad_fake_data = D.backward(g_grad, lr=0) # don't update D here
G.backward(grad_fake_data, lr=g_lr)
if epoch % 500 == 0:
samples = G.forward(np.random.randn(1000, 1))
print(f"Epoch {epoch}: G mean={samples.mean():.3f}, "
f"G std={samples.std():.3f}, D loss={d_loss_r+d_loss_f:.3f}")
# Epoch 0: G mean=0.142, G std=0.087, D loss=1.421
# Epoch 500: G mean=2.187, G std=0.391, D loss=1.312
# Epoch 1000: G mean=2.841, G std=0.478, D loss=1.378
# Epoch 1500: G mean=2.967, G std=0.503, D loss=1.386
# Epoch 2000: G mean=2.994, G std=0.498, D loss=1.386 ← converged!
Watch the convergence: the generator’s mean creeps from 0.14 toward 3.0, and its standard deviation from 0.09 toward 0.5. The discriminator loss settles near log(4) ≈ 1.386 — exactly the theoretical optimum where D(x) = 0.5 everywhere (the loss as coded is −log D(x) − log(1−D(G(z))), which equals 2·log(2) when D = 0.5). The discriminator has given up: it literally cannot tell real from fake.
A subtle but important detail in the training loop: when computing the generator’s gradient, we pass lr=0 to the discriminator’s backward pass. This means we compute the gradient flowing through D without updating D’s weights. The gradient flows through the discriminator like water through a pipe — the pipe shapes the flow but doesn’t change.
Mode Collapse: The GAN’s Achilles Heel
The 1D Gaussian is a single-mode distribution — one peak, one target. But real data is multi-modal. Handwritten digits have 10 modes (0-9). Human faces have millions of distinct configurations. What happens when we ask a GAN to learn a mixture of distributions?
Here’s where GANs reveal their most notorious failure: mode collapse. The generator discovers one output that reliably fools the discriminator and produces only that output, ignoring the full diversity of the data. Imagine our art forger learning to paint a single perfect Monet — every painting they produce is the same Monet, because the detective keeps passing it. The forger has no incentive to learn Picasso or Van Gogh when their Monet works every time.
Mode collapse comes in two flavors. Complete collapse means the generator produces a single output regardless of input noise — it learns a constant function, mapping all of z-space to one point. Partial collapse means the generator covers some modes but not others, perhaps producing 3 of 8 distinct clusters.
Why does this happen? The minimax game doesn’t guarantee convergence to the Nash equilibrium. In practice, training alternates between updating D and G via gradient descent. Goodfellow’s proof that the global optimum is pg = pdata assumes we can optimize D to completion at each step, which is impractical. With finite gradient steps, the two networks can oscillate indefinitely: G finds a mode that fools D, D catches on, G jumps to a different mode, D catches on again, and the cycle repeats without ever settling.
There’s a deep asymmetry in the GAN loss that makes this worse. The generator is rewarded for producing any output that fools the discriminator. There’s no term in the loss encouraging diversity. Compare this to VAEs, where the KL divergence term forces the encoder to spread out across the latent space — VAE samples are blurrier but always diverse. GANs trade diversity for sharpness, and mode collapse is the price.
Mode collapse produces individual samples of stunning quality. The tragedy is that they’re all the same sample.
Wasserstein GAN: A Better Distance Metric
The root cause of GAN training instability goes deeper than mode collapse — it’s baked into the distance metric the standard GAN loss implicitly minimizes. Jensen-Shannon divergence has a fatal flaw: when two distributions don’t overlap (which is almost always the case in high dimensions, where real data lives on a thin manifold), JSD equals a constant log(2). A constant has zero gradient. The discriminator achieves perfect accuracy, the generator receives no learning signal, and training stalls.
The Wasserstein distance (a.k.a. Earth Mover’s distance) fixes this. Instead of measuring probability overlap, it measures the minimum cost of transporting one distribution to match the other — like the minimum amount of earth you’d need to move to reshape one pile into another:
Crucially, Wasserstein distance provides smooth, meaningful gradients even when the two distributions don’t overlap at all. If two point masses are 10 units apart, W = 10. Move them 1 unit closer, W = 9. The gradient is constant and always useful — no cliffs, no plateaus, no vanishing.
We can’t compute the infimum directly, but the Kantorovich-Rubinstein duality gives us a tractable alternative: the Wasserstein distance equals the supremum over all 1-Lipschitz functions f of E[f(xreal)] - E[f(xfake)]. A 1-Lipschitz function is one where |f(a) - f(b)| ≤ |a - b| — it can’t change faster than the input changes.
The WGAN replaces the discriminator (renamed “critic”) with a neural network learning this Lipschitz function. No sigmoid on the output — the critic outputs an unbounded score. No logarithm in the loss — just the raw difference of means. And to enforce the Lipschitz constraint, we clip the critic’s weights to a small range [-c, c] after each update.
class WassersteinCritic:
"""WGAN critic: outputs unbounded score, no sigmoid."""
def __init__(self, input_dim=2, hidden_dim=64):
self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim)
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(hidden_dim, 1) * np.sqrt(2.0 / hidden_dim)
self.b2 = np.zeros(1)
def forward(self, x):
self.x = x
self.h = np.maximum(0, x @ self.W1 + self.b1)
self.score = self.h @ self.W2 + self.b2 # no sigmoid!
return self.score
def backward(self, grad_score, lr=0.0001):
grad_W2 = self.h.T @ grad_score
grad_b2 = grad_score.sum(axis=0)
grad_h = grad_score @ self.W2.T
grad_h = grad_h * (self.h > 0)
grad_W1 = self.x.T @ grad_h
grad_b1 = grad_h.sum(axis=0)
self.W2 -= lr * grad_W2 / len(grad_score)
self.b2 -= lr * grad_b2 / len(grad_score)
self.W1 -= lr * grad_W1 / len(grad_score)
self.b1 -= lr * grad_b1 / len(grad_score)
return grad_h @ self.W1.T # gradient to input
def clip_weights(self, clip_value=0.01):
"""Enforce Lipschitz constraint by clamping weights."""
self.W1 = np.clip(self.W1, -clip_value, clip_value)
self.W2 = np.clip(self.W2, -clip_value, clip_value)
def wasserstein_loss(critic_real, critic_fake):
"""WGAN loss: maximize E[f(real)] - E[f(fake)] for critic.
No logs, no sigmoid — just the raw mean difference."""
w_distance = np.mean(critic_real) - np.mean(critic_fake)
# Critic gradient: ascend on real scores, descend on fake scores
grad_real = np.ones_like(critic_real) # +1
grad_fake = -np.ones_like(critic_fake) # -1
return w_distance, grad_real, grad_fake
The WGAN training loop differs from vanilla GAN in three ways: (1) the critic trains for 5 steps per generator step (it needs to approximate the supremum well), (2) weight clipping after every critic update, and (3) the loss function has no logarithm — just the difference of mean scores. The Wasserstein distance estimate decreases monotonically as samples improve, which is a luxury standard GANs never had: you can actually watch the loss and know if training is working.
Weight clipping has known issues — the critic tends to learn overly simple functions with weights piling up at +c and -c. The improved version, WGAN-GP (Gulrajani et al., 2017), replaces clipping with a gradient penalty that directly penalizes the critic’s gradient norm for deviating from 1. This works better in practice, but weight clipping captures the core idea and is simpler to implement from scratch.
Training a 2D GAN: Point Distributions
Now let’s see where the Wasserstein distance really shines. We’ll train a GAN on a harder target: 8 Gaussians arranged in a circle. This is the classic mode collapse stress test — the generator must learn to produce points from all 8 clusters, not just the easiest one.
The target data looks like 8 tight clusters equally spaced on a circle of radius 2. Each cluster is a small Gaussian with standard deviation 0.05. A perfect generator would produce samples distributed evenly across all 8 modes.
def sample_8_gaussians(n, radius=2.0, std=0.05):
"""Sample from 8 Gaussians arranged in a circle."""
centers = []
for i in range(8):
angle = 2 * np.pi * i / 8
centers.append([radius * np.cos(angle), radius * np.sin(angle)])
centers = np.array(centers)
# Pick random centers, add Gaussian noise
indices = np.random.randint(0, 8, size=n)
samples = centers[indices] + np.random.randn(n, 2) * std
return samples, indices
def train_wgan_2d(epochs=3000, batch_size=256, n_critic=5):
"""Train WGAN on 8-Gaussians target. Returns training history."""
G = Generator(noise_dim=2, hidden_dim=128, output_dim=2)
C = WassersteinCritic(input_dim=2, hidden_dim=128)
history = []
for epoch in range(epochs):
# --- Train Critic for n_critic steps ---
for _ in range(n_critic):
real, _ = sample_8_gaussians(batch_size)
noise = np.random.randn(batch_size, 2)
fake = G.forward(noise)
c_real = C.forward(real)
c_fake = C.forward(fake)
w_dist, grad_r, grad_f = wasserstein_loss(c_real, c_fake)
# Fake activations are current — process fake first
C.backward(-grad_f, lr=0.0002) # negate: we maximize
C.forward(real) # restore real activations
C.backward(-grad_r, lr=0.0002)
C.clip_weights(0.01)
# --- Train Generator ---
noise = np.random.randn(batch_size, 2)
fake = G.forward(noise)
c_fake = C.forward(fake)
# G wants to maximize critic score → minimize -f(G(z))
grad_g = -np.ones_like(c_fake)
grad_input = C.backward(grad_g, lr=0)
G.backward(grad_input, lr=0.0001)
if epoch % 100 == 0:
history.append((epoch, w_dist, fake.copy()))
return G, history
The difference between vanilla GAN and WGAN on this task is dramatic. A vanilla GAN typically collapses to 1–3 of the 8 modes, producing high-quality samples that cluster tightly around a few centers while completely ignoring the rest. The WGAN, with its smooth gradients, learns to cover all 8 modes — perhaps not perfectly uniformly, but with meaningful mass on each one.
The Wasserstein distance as a training metric is transformative. With standard GANs, the discriminator loss oscillates unpredictably and tells you nothing about sample quality. With WGANs, the estimated Wasserstein distance correlates with visual quality: if the number goes down, the samples are getting better. “If the loss is decreasing, your model is improving” sounds obvious, but it was a genuine luxury that standard GANs never provided.
Conditional GANs and DCGAN: Scaling Up
Everything so far generates samples from the overall data distribution — you get random digits, random faces, random clusters. Conditional GANs (Mirza & Osindero, 2014) add control: condition both the generator and discriminator on auxiliary information y (like a class label), and you can say “generate a 7” instead of “generate a random digit.”
The implementation is simple: concatenate a one-hot encoded label with the generator’s noise input and with the discriminator’s data input. The generator becomes G(z, y) → x, and the discriminator becomes D(x, y) → [0, 1]. The discriminator now answers a more specific question: “is this a real sample of class y?”
For images, the architecture breakthrough was DCGAN (Radford et al., 2015) — a specific recipe of design choices that made deep convolutional GANs reliably trainable for the first time. The core ideas build directly on the convnets post:
- Replace pooling with strided convolutions in D (learned downsampling) and transposed convolutions in G (learned upsampling)
- BatchNorm in both networks, except G’s output layer and D’s input layer
- ReLU in the generator, LeakyReLU in the discriminator (leak = 0.2)
- Tanh activation on the generator output (normalizes images to [-1, 1])
- No fully connected hidden layers — all-convolutional for spatial coherence
DCGAN also revealed a magical property of the learned latent space: vector arithmetic. Just as word embeddings support “king − man + woman = queen” (as we explored in the embeddings post), DCGAN’s latent space supports “man with glasses − man without glasses + woman without glasses = woman with glasses.” The generator learns a structured latent space where directions correspond to semantic attributes — without being told what those attributes are.
VAEs vs. GANs vs. Diffusion: Three Paradigms
We’ve now built all three major generative model families in this series. Let’s put them side by side — three philosophies of creation, three sets of trade-offs:
| VAE | GAN | Diffusion | |
|---|---|---|---|
| Core idea | Compress & reconstruct | Compete & create | Add noise & denoise |
| Training | Stable (single loss) | Unstable (adversarial) | Stable (simple MSE) |
| Sample quality | Blurry | Sharp | Sharpest |
| Diversity | High | Low (mode collapse) | High |
| Generation speed | Fast (1 pass) | Fast (1 pass) | Slow (20–1000 steps) |
| Likelihood | Explicit (ELBO) | Implicit (none) | Explicit (ELBO) |
| Loss function | Recon + KL | Minimax game | Denoising MSE |
Why diffusion “won” for most generation tasks: Training stability matters enormously at scale. Diffusion models use a simple MSE loss on noise prediction — no adversarial dynamics, no oscillation, no mode collapse. They capture the full data distribution by denoising from random noise, naturally avoiding the diversity problem. And they scale predictably with compute, while GAN quality is fragile and hard to predict.
Where GANs remain relevant: Speed. A GAN generates in a single forward pass. Diffusion needs 20–1000 sequential denoising steps. For real-time applications — video super-resolution, game texture generation, interactive tools — GANs are still competitive. The newest hybrid approaches (Adversarial Diffusion Distillation in SDXL Turbo, Consistency Models) use GAN discriminators to distill slow diffusion models into fast one-step generators — combining the best of both worlds.
And the intellectual legacy is immense. The adversarial training paradigm appears everywhere: RLHF’s reward model is essentially a discriminator that judges language quality (as we saw in the RLHF post). Contrastive learning uses discriminative objectives. Domain adaptation uses adversarial feature alignment. GANs didn’t just generate images — they introduced a way of thinking about learning through competition that permeates modern AI.
Try It Yourself
Interactive GAN Demos
Panel 1: 1D GAN Training Arena
Watch a GAN learn to match a target distribution in real time. The non-saturating loss drives convergence. Blue = target, red = generated, green dashed = D(x).
Panel 2: Mode Collapse Detector
8 Gaussians in a circle. Standard GAN often covers fewer modes with more scatter; WGAN converges more reliably. Toggle loss to compare.
Panel 3: Latent Space Walker
A pre-trained generator maps 2D noise to a target pattern. Drag the slider to interpolate between two latent points and watch the decoded output morph smoothly.
Where GANs Fit in the Series
This post completes the generative model trilogy and connects to nearly every post in the elementary series:
- Autoencoders & VAEs — VAEs learn explicit distributions via reconstruction + KL regularization; GANs learn implicit distributions via competition. VAEs guarantee diversity (KL forces latent coverage) at the cost of blurriness. GANs guarantee sharpness (the discriminator demands it) at the cost of diversity. Combining them yields VAE-GAN: VAE structure for the latent space, GAN discriminator for sharp outputs.
- Diffusion Models — The successor paradigm. Understanding GANs explains exactly what diffusion had to beat (mode collapse, training instability) and what it sacrificed (single-pass speed). Modern hybrid approaches use GAN losses to speed up diffusion (Adversarial Diffusion Distillation).
- Loss Functions — The GAN loss is fundamentally different from every other loss in the series: a minimax game rather than a single objective. The Wasserstein loss goes further, replacing log-probability with Earth Mover’s distance. This is the first post where two networks optimize against each other.
- Optimizers — GAN training requires careful optimizer tuning. Different learning rates for G and D, Adam with β1=0.0 (instead of the usual 0.9) for WGAN-GP, and the critical balance between critic capacity and generator expressiveness.
- Normalization — BatchNorm in DCGAN stabilizes both networks. Spectral normalization constrains the discriminator’s Lipschitz constant, directly connecting to the theory behind Wasserstein distance.
- ConvNets — DCGAN uses the same convolutional building blocks, adding transposed convolutions for the generator’s learned upsampling. The discriminator is essentially a CNN image classifier.
- RLHF — The reward model in RLHF plays the same structural role as the discriminator in GANs: both learn to judge quality and guide a generative process through adversarial-style feedback.
- Knowledge Distillation — Adversarial Diffusion Distillation uses GAN discriminator losses to compress a multi-step diffusion model into a fast single-step generator, marrying the two paradigms.
- Micrograd — The adversarial training loop is backpropagation extended to a two-player game: two computation graphs, two sets of gradients, optimized in alternation with one player frozen.
- Embeddings — DCGAN’s latent space supports vector arithmetic just like word embeddings: “man with glasses − man + woman = woman with glasses.” The generator learns structured representations without supervision.
References & Further Reading
- Ian Goodfellow et al. — “Generative Adversarial Nets” (NeurIPS 2014) — the original paper that started it all
- Lilian Weng — “From GAN to WGAN” — comprehensive mathematical walkthrough of GAN theory and Wasserstein distance
- Martin Arjovsky, Soumith Chintala, Léon Bottou — “Wasserstein GAN” (2017) — Earth Mover’s distance for stable GAN training
- Alec Radford, Luke Metz, Soumith Chintala — “Unsupervised Representation Learning with DCGANs” (2015) — the architectural recipe that made image GANs work
- Mehdi Mirza, Simon Osindero — “Conditional Generative Adversarial Nets” (2014) — class-conditioned generation
- Ishaan Gulrajani et al. — “Improved Training of Wasserstein GANs” (2017) — gradient penalty as a better Lipschitz constraint
- Tero Karras et al. — “Progressive Growing of GANs” (2017) — high-resolution face generation via progressive training
- Google Developers — “Common GAN Training Problems” — practical guide to mode collapse and training instability