Adversarial Examples from Scratch: How Invisible Perturbations Fool Neural Networks

February 27, 2026 · Elementary · 15 min read

The Surprise — When AI Sees What Isn’t There

Add noise to an image that’s invisible to the human eye, and a state-of-the-art neural network will classify a panda as a gibbon with 99.3% confidence. This isn’t a bug in one particular model — it works on every architecture, every dataset, every training procedure ever tested. It’s a fundamental property of how neural networks process high-dimensional data.

In 2014, Christian Szegedy and colleagues at Google published a paper titled “Intriguing properties of neural networks” that shook the deep learning world. They demonstrated that for any correctly classified input, you can find a tiny perturbation — imperceptible to humans — that causes the model to misclassify with high confidence. These adversarial examples aren’t edge cases. They surround every single data point in every direction.

The natural reaction is to blame overfitting, or insufficient training data, or some architectural flaw. But adversarial examples persist in models with perfect test accuracy. They transfer across architectures — an adversarial image crafted for one network often fools a completely different one. And here’s the real kicker: the cause isn’t nonlinearity or complexity. It’s linearity.

In this post, we’ll build adversarial attacks from scratch, understand why they work, learn to defend against them, and discover what they reveal about what neural networks actually learn. Let’s start by creating our first adversarial example.

Your First Adversarial Attack

We’ll train a simple two-layer MLP on a 2D classification task — two concentric circles of points — and then use the gradient of the loss with respect to the input (not the weights) to craft a perturbation that flips the model’s prediction. This is the core mechanic behind every gradient-based adversarial attack.

import numpy as np

np.random.seed(42)
n = 200
t = np.random.uniform(0, 2 * np.pi, n)
r0 = 1.0 + np.random.randn(n // 2) * 0.15   # inner ring → class 0
r1 = 2.5 + np.random.randn(n // 2) * 0.15   # outer ring → class 1
X = np.vstack([np.c_[r0 * np.cos(t[:n//2]), r0 * np.sin(t[:n//2])],
               np.c_[r1 * np.cos(t[n//2:]), r1 * np.sin(t[n//2:])]])
y = np.array([0] * (n // 2) + [1] * (n // 2))

relu = lambda z: np.maximum(0, z)
sigmoid = lambda z: 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# Train a 2 → 16 → 1 MLP
np.random.seed(7)
W1 = np.random.randn(2, 16) * 0.5
b1 = np.zeros(16)
W2 = np.random.randn(16, 1) * 0.5
b2 = np.zeros(1)

for _ in range(2000):
    h = relu(X @ W1 + b1)
    p = sigmoid(h @ W2 + b2).ravel()
    err = (p - y).reshape(-1, 1)
    dW2 = h.T @ err / n;  db2 = err.mean()
    dh = err @ W2.T;      dh[h == 0] = 0
    dW1 = X.T @ dh / n;   db1 = dh.mean(axis=0)
    W1 -= 0.5 * dW1;  b1 -= 0.5 * db1
    W2 -= 0.5 * dW2;  b2 -= 0.5 * db2

# Pick a class-1 point (outer ring)
x_test = np.array([[2.3, 0.5]])
h_t = relu(x_test @ W1 + b1)
conf = sigmoid(h_t @ W2 + b2).item()
print(f"Original: class 1, confidence = {conf:.3f}")

# Compute gradient of loss w.r.t. INPUT via manual backprop
z1 = x_test @ W1 + b1
h1 = relu(z1)
z2 = h1 @ W2 + b2
p = sigmoid(z2).item()

dp = -1.0 / (p + 1e-10)           # dL/dp for L = -log(p)
dz2 = dp * p * (1 - p)            # sigmoid backward
dh1 = dz2 * W2.T                  # linear backward
dz1 = dh1 * (z1 > 0)              # ReLU backward
dx = dz1 @ W1.T                   # input gradient

# FGSM: perturb in the direction that INCREASES loss
epsilon = 0.5
x_adv = x_test + epsilon * np.sign(dx)

h_a = relu(x_adv @ W1 + b1)
conf_adv = sigmoid(h_a @ W2 + b2).item()
pred_adv = 1 if conf_adv > 0.5 else 0
print(f"Adversarial: class {pred_adv}, confidence = {max(conf_adv, 1-conf_adv):.3f}")
print(f"Perturbation: ({x_test[0,0]:.1f}, {x_test[0,1]:.1f}) → ({x_adv[0,0]:.2f}, {x_adv[0,1]:.2f})")
print(f"L_inf norm of perturbation: {np.abs(x_adv - x_test).max():.2f}")

The key line is dx = dz1 @ W1.T — this is the gradient of the loss with respect to the input, computed by running backpropagation all the way back through the network. Training uses this same gradient machinery to update weights. FGSM uses it to update the input.

The perturbation moved our point by just 0.5 units per dimension, but that was enough to push it across the decision boundary. In 2D, this perturbation is visible. In 784 dimensions (MNIST) or 150,528 dimensions (ImageNet), a perturbation of ε = 0.01 per dimension is completely invisible to humans — but devastating to models. Why?

The Linearity Hypothesis — Why This Happens

In 2015, Ian Goodfellow, Jonathon Shlens, and Christian Szegedy published the paper that explained everything. Their key insight was counterintuitive: adversarial vulnerability doesn’t come from neural networks being too complex or too nonlinear. It comes from them being too linear.

Consider a simple linear model: y = w^Tx. If we add a perturbation δ = ε · sign(w), the output changes by:

w^Tδ = ε · ∑|w_i| = ε · ||w||₁

Each component of δ is tiny — just ±ε — so the perturbation is imperceptible in any single dimension. But the dot product sums across all dimensions. In a space with d = 784 dimensions (an MNIST image) where average |w_i| ≈ 0.1, the output shifts by 0.1 × 784 × ε = 78.4ε. Even at ε = 0.01, the output shifts by 0.784 — easily enough to flip a classification.

This is the linearity hypothesis: the vulnerability grows linearly with the number of input dimensions. And neural networks, despite their nonlinear activation functions, are “sufficiently linear” for this to apply. ReLU networks are literally piecewise linear. Even smooth activations like sigmoid and tanh operate in their linear regime for most inputs (the steep part of the S-curve).

Here’s the math in action:

import numpy as np

np.random.seed(42)
dimensions = [10, 50, 100, 500, 1000, 5000]

print(f"{'Dim':>6s} | {'||w||_1':>8s} | {'shift @ 0.01':>12s} | {'shift @ 0.05':>12s} | {'shift @ 0.10':>12s}")
print("-" * 65)
for d in dimensions:
    w = np.random.randn(d) * 0.1          # avg |w_i| ~ 0.08
    l1 = np.sum(np.abs(w))

    shifts = [eps * l1 for eps in [0.01, 0.05, 0.10]]
    print(f"d={d:>4d} | {l1:>8.1f} | {shifts[0]:>12.1f} | {shifts[1]:>12.1f} | {shifts[2]:>12.1f}")

print("\nThe output shift = epsilon * ||w||_1, which grows with dimension.")
print("At d=5000 and epsilon=0.01, the shift is ~40x the original output!")

# Compare adversarial perturbation vs random noise
d = 1000
w = np.random.randn(d) * 0.1
adv_delta = 0.01 * np.sign(w)              # adversarial: aligned with w
rand_delta = 0.01 * np.sign(np.random.randn(d))  # random direction
print(f"\nd=1000: adversarial shift = {np.abs(w @ adv_delta):.2f}")
print(f"d=1000: random noise shift = {np.abs(w @ rand_delta):.2f}")
print("Adversarial perturbations are special: they align with the weights.")

This explains two mysteries at once. First, why adversarial perturbations are so much more effective than random noise of the same magnitude: they’re aligned with the model’s weight vector, so every dimension contributes positively to the output shift. Random noise partially cancels out. Second, why adversarial examples transfer across architectures: different networks trained on the same data learn similar linear approximations, so the same adversarial direction works against all of them.

FGSM — The Fast Gradient Sign Method

The linearity hypothesis leads directly to the simplest and most elegant adversarial attack: the Fast Gradient Sign Method (FGSM). Given an input x with true label y, a model with parameters θ, and a loss function L, the adversarial perturbation is:

δ = ε · sign(∇_x L(θ, x, y))

That’s it. One forward pass to compute the loss. One backward pass to get the gradient with respect to the input. Take the sign, scale by ε. The adversarial example is x + δ.

Notice the deep connection to training. During training, we compute ∇_θL and ask: “how should I change the weights to reduce loss?” FGSM computes ∇_xL and asks: “how should I change the input to increase loss?” Same gradient. Same backpropagation algorithm. Different target.

The sign() operation is what makes this an L_∞ attack: every input dimension gets perturbed by exactly ±ε, maximizing the total output shift as predicted by the linearity hypothesis. The ε parameter is the perturbation budget — the maximum change allowed in any single dimension.

Let’s implement FGSM on a proper multi-class classifier and see how attack success scales with ε:

import numpy as np

np.random.seed(42)
n_per_class, d = 100, 64  # 8x8 "images"
X, y = [], []
for i in range(n_per_class):
    img = np.random.randn(d) * 0.1
    img[24:40] += 1.0               # rows 3-4 bright (horizontal bar)
    X.append(img); y.append(0)

    img = np.random.randn(d) * 0.1
    img[3::8] += 1.0                # column 3 bright (vertical bar)
    X.append(img); y.append(1)

    img = np.random.randn(d) * 0.1
    for k in range(8): img[k*8 + k] += 1.0  # diagonal
    X.append(img); y.append(2)

X, y = np.array(X), np.array(y)

def softmax(z):
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)

# Train 64 → 32 → 3 MLP
np.random.seed(7)
W1 = np.random.randn(64, 32) * np.sqrt(2 / 64)
b1 = np.zeros(32)
W2 = np.random.randn(32, 3) * np.sqrt(2 / 32)
b2 = np.zeros(3)

for _ in range(500):
    h = np.maximum(0, X @ W1 + b1)
    probs = softmax(h @ W2 + b2)
    one_hot = np.zeros_like(probs)
    one_hot[np.arange(len(y)), y] = 1
    dz2 = (probs - one_hot) / len(X)
    W2 -= 1.0 * h.T @ dz2;   b2 -= 1.0 * dz2.sum(axis=0)
    dh = dz2 @ W2.T;          dh[h == 0] = 0
    W1 -= 1.0 * X.T @ dh;     b1 -= 1.0 * dh.sum(axis=0)

def fgsm(x, label, eps):
    z1 = x.reshape(1, -1) @ W1 + b1
    h = np.maximum(0, z1)
    probs = softmax(h @ W2 + b2)
    one_hot = np.zeros((1, 3));  one_hot[0, label] = 1
    dz2 = probs - one_hot
    dh = dz2 @ W2.T;  dh[h == 0] = 0
    dx = (dh @ W1.T).flatten()           # gradient w.r.t. input
    return x + eps * np.sign(dx)         # FGSM perturbation

# Attack success rate vs epsilon
for eps in [0.01, 0.05, 0.1, 0.2, 0.3]:
    flipped = 0
    for i in range(len(X)):
        x_adv = fgsm(X[i], y[i], eps)
        pred = softmax((np.maximum(0, x_adv.reshape(1,-1) @ W1 + b1) @ W2 + b2)).argmax()
        if pred != y[i]:
            flipped += 1
    print(f"eps={eps:.2f}: {flipped}/{len(X)} flipped ({100*flipped/len(X):.0f}% attack success)")

Even at ε = 0.1 — a 10% perturbation per pixel, barely visible in an image — a significant fraction of inputs get misclassified. At ε = 0.3, the attack succeeds on nearly everything. And this is a single-step attack. What if we iterate?

Try It: Adversarial Playground

Click anywhere on the canvas to place a test point. The model classifies it, then FGSM or PGD computes the adversarial perturbation (shown as an arrow). Toggle between the standard model (wiggly boundary) and the adversarially-trained model (smoother boundary).

ε: 0.50

Click the canvas to attack a point

PGD — Making Attacks Stronger

FGSM takes a single, maximal step. But what if the loss landscape is curved, and a single step doesn’t find the worst-case perturbation? Projected Gradient Descent (PGD), introduced by Aleksander Madry and colleagues in 2018, solves this by iterating: take small FGSM-like steps and project back onto the ε-ball after each one.

The algorithm is elegant:

Start from a random point inside the ε-ball: x₀ = x + uniform noise in [−ε, ε]
At each step t, take an FGSM step with step size α: x_t+1 = x_t + α · sign(∇_xL)
Project back: clip x_t+1 to stay within [x − ε, x + ε]
Repeat for T steps (typically 20–40)

Madry et al. showed that PGD is, in a precise sense, the “ultimate” first-order attack. If a model is robust to PGD, it’s robust to any attack that only uses gradient information. PGD finds adversarial examples that FGSM misses, especially near curved decision boundaries.

import numpy as np

def pgd_attack(x, label, eps, steps=20, step_size=None):
    """Projected Gradient Descent: iterated FGSM with projection."""
    if step_size is None:
        step_size = eps / 5

    # Start from random point in epsilon-ball
    x_adv = x + np.random.uniform(-eps, eps, x.shape)
    x_adv = np.clip(x_adv, x - eps, x + eps)
    losses = []

    for _ in range(steps):
        z1 = x_adv.reshape(1, -1) @ W1 + b1
        h = np.maximum(0, z1)
        logits = h @ W2 + b2
        probs = softmax(logits).flatten()
        losses.append(-np.log(probs[label] + 1e-10))

        one_hot = np.zeros((1, 3));  one_hot[0, label] = 1
        dz2 = probs.reshape(1, -1) - one_hot
        dh = dz2 @ W2.T;  dh[h == 0] = 0
        dx = (dh @ W1.T).flatten()

        x_adv = x_adv + step_size * np.sign(dx)      # FGSM step
        x_adv = np.clip(x_adv, x - eps, x + eps)     # project onto ball

    return x_adv, losses

# Compare FGSM vs PGD at the same epsilon
np.random.seed(42)
eps = 0.15
fgsm_flips, pgd_flips = 0, 0
n_test = 50
for i in range(n_test):
    idx = i * 6
    x_adv_f = fgsm(X[idx], y[idx], eps)
    if softmax((np.maximum(0, x_adv_f.reshape(1,-1) @ W1 + b1) @ W2 + b2)).argmax() != y[idx]:
        fgsm_flips += 1

    x_adv_p, _ = pgd_attack(X[idx], y[idx], eps)
    if softmax((np.maximum(0, x_adv_p.reshape(1,-1) @ W1 + b1) @ W2 + b2)).argmax() != y[idx]:
        pgd_flips += 1

print(f"Attack success rate at eps={eps}:")
print(f"  FGSM (1 step):   {fgsm_flips}/{n_test} = {100*fgsm_flips/n_test:.0f}%")
print(f"  PGD  (20 steps): {pgd_flips}/{n_test} = {100*pgd_flips/n_test:.0f}%")

# Show loss climbing during PGD
_, losses = pgd_attack(X[0], y[0], eps=0.15, steps=20)
print(f"\nPGD loss trajectory: {losses[0]:.3f} → {losses[-1]:.3f} (grew {losses[-1]/max(losses[0],1e-6):.1f}x)")

PGD consistently achieves higher attack success rates than FGSM at the same perturbation budget. The loss increases monotonically across iterations — each step pushes the input a little further into adversarial territory. This is gradient ascent on the loss, projected onto a constraint set — the mirror image of how we train neural networks.

Adversarial Training — Fighting Fire with Fire

If the attack uses gradients to maximize loss, the natural defense is to train on adversarial examples. At each training step, generate adversarial perturbations of the current batch using FGSM or PGD, then update the weights to correctly classify those perturbed inputs. This is adversarial training, and it’s formulated as a min-max optimization:

min_θ E[ max_||δ||≤ε L(θ, x + δ, y) ]

The inner maximization (finding the worst-case perturbation) is solved by PGD. The outer minimization (updating the model) is standard gradient descent. This is the same min-max structure that appears in GANs and game theory — the model and the attacker playing against each other, with the model getting stronger in response to increasingly clever attacks.

Adversarial training works remarkably well, but comes with real costs:

3–10x slower training — generating PGD examples at every step requires multiple forward and backward passes
Reduced clean accuracy — Tsipras et al. (2019) proved this tradeoff is inherent. You mathematically cannot have both perfect standard accuracy and perfect robust accuracy.
More interpretable features — robust models learn shape and edge features (like humans) instead of texture and high-frequency patterns. This is a feature, not a bug.

import numpy as np

def train_classifier(X, y, epochs=500, adversarial=False, eps=0.15):
    """Train a 64->32->3 MLP, optionally with adversarial training."""
    np.random.seed(7)
    W1 = np.random.randn(64, 32) * np.sqrt(2 / 64)
    b1 = np.zeros(32)
    W2 = np.random.randn(32, 3) * np.sqrt(2 / 32)
    b2 = np.zeros(3)

    for _ in range(epochs):
        X_batch = X.copy()

        if adversarial:
            # Inner max: FGSM on current batch
            h = np.maximum(0, X @ W1 + b1)
            probs = softmax(h @ W2 + b2)
            one_hot = np.zeros_like(probs)
            one_hot[np.arange(len(y)), y] = 1
            dz = probs - one_hot
            dh = dz @ W2.T;  dh[h == 0] = 0
            dx = dh @ W1.T
            X_batch = X + eps * np.sign(dx)    # adversarial inputs

        # Outer min: standard training on (possibly adversarial) inputs
        h = np.maximum(0, X_batch @ W1 + b1)
        probs = softmax(h @ W2 + b2)
        one_hot = np.zeros_like(probs)
        one_hot[np.arange(len(y)), y] = 1
        dz = (probs - one_hot) / len(X)
        W2 -= 1.0 * h.T @ dz;   b2 -= 1.0 * dz.sum(axis=0)
        dh = dz @ W2.T;          dh[h == 0] = 0
        W1 -= 1.0 * X_batch.T @ dh;  b1 -= 1.0 * dh.sum(axis=0)

    return W1, b1, W2, b2

def accuracy(W1, b1, W2, b2, X, y, eps=None):
    h = np.maximum(0, X @ W1 + b1)
    probs = softmax(h @ W2 + b2)
    clean = (probs.argmax(axis=1) == y).mean()
    if eps is None:
        return clean, 0.0
    # FGSM attack
    one_hot = np.zeros_like(probs)
    one_hot[np.arange(len(y)), y] = 1
    dz = probs - one_hot
    dh = dz @ W2.T;  dh[h == 0] = 0
    dx = dh @ W1.T
    X_adv = X + eps * np.sign(dx)
    h_a = np.maximum(0, X_adv @ W1 + b1)
    robust = (softmax(h_a @ W2 + b2).argmax(axis=1) == y).mean()
    return clean, robust

W1s, b1s, W2s, b2s = train_classifier(X, y, adversarial=False)
W1r, b1r, W2r, b2r = train_classifier(X, y, adversarial=True, eps=0.15)

split = int(0.7 * len(X))
X_te, y_te = X[split:], y[split:]

for name, w1, b1, w2, b2 in [("Standard", W1s, b1s, W2s, b2s),
                               ("Adversarial", W1r, b1r, W2r, b2r)]:
    clean, robust = accuracy(w1, b1, w2, b2, X_te, y_te, eps=0.15)
    print(f"{name:>12s} | Clean: {clean:.1%} | FGSM (eps=0.15): {robust:.1%}")

The standard model achieves high clean accuracy but collapses under attack. The adversarially trained model sacrifices a few points of clean accuracy in exchange for dramatically improved robustness. This is the robustness-accuracy tradeoff — adversarial training acts as a powerful regularizer that forces the model to rely on robust, human-interpretable features.

Property	Standard Training	Adversarial Training
Clean accuracy	High	Slightly lower
FGSM robustness	Near zero	Substantial
PGD robustness	Zero	Moderate-to-high
Training speed	Fast (1x)	Slow (3–10x)
Learned features	Texture, high-freq	Shape, edges
Feature interpretability	Low	High

Try It: Perturbation Budget Explorer

See how adversarial perturbations grow with ε. The left panel shows an 8×8 “image” pattern, the center shows which pixels matter most (gradient heatmap), and the right shows the adversarial version. Toggle “Random Noise” to compare: adversarial perturbations are structured, random noise is not.

ε: 0.15

What Adversarial Examples Teach Us About Neural Networks

Adversarial examples aren’t just a security concern — they’re a window into how neural networks think. And what we see through that window is surprising.

In 2019, Andrew Ilyas and colleagues at MIT published a landmark paper with a provocative title: “Adversarial Examples Are Not Bugs, They Are Features.” They ran an experiment that turned conventional thinking on its head. They took adversarial examples — images with perturbations designed to be classified as the wrong label — and trained a new model on them with those wrong labels. If adversarial perturbations were meaningless noise, this model should learn nothing useful.

Instead, the model achieved above-chance accuracy on clean, unperturbed test images.

This means the adversarial perturbations contain real, genuinely predictive features — patterns that are correlated with the “target” class. Humans simply cannot perceive them. Neural networks exploit these non-robust features: imperceptible high-frequency patterns, texture statistics, and subtle pixel correlations that happen to be genuinely predictive of the label on the training distribution.

Standard models use both robust features (edges, shapes — the things humans see) and non-robust features (texture statistics, high-frequency patterns — things humans can’t see). Both help minimize training loss. But non-robust features, while predictive, are fragile — a tiny adversarial perturbation can flip them. Adversarial training forces models to rely only on robust features, which is why adversarially trained models:

Produce more interpretable gradients (the gradient looks like the object)
Are more aligned with human perception (they “see” shapes, not textures)
Transfer better to distribution shifts (robust features are more stable)

The meta-lesson is profound: neural networks are solving a different problem than we think. They don’t learn “what a cat looks like” the way we do. They find the shortest path to low loss, and that path runs through features we never intended. Adversarial examples expose this gap between the objective we optimize and the behavior we actually want.

The Arms Race — Beyond Gradient Attacks

FGSM and PGD are “white-box” attacks — they assume full access to the model’s weights and gradients. But the adversarial threat extends far beyond:

Black-box attacks work without model access. Transfer attacks exploit the fact that adversarial examples transfer: craft adversarial examples on a substitute model and they’ll often fool the target. Query-based attacks estimate gradients by observing how the model’s output changes as inputs are perturbed — typically requiring only a few thousand queries.

Physical-world attacks work outside the computer. Researchers have printed adversarial patches that cause classifiers to misidentify stop signs. Adversarial T-shirts can confuse person detectors. 3D-printed adversarial objects fool classifiers from multiple angles. These aren’t academic curiosities — they demonstrate that adversarial vulnerability persists even when the attacker can’t control every pixel.

Certified defenses offer mathematical guarantees. Randomized smoothing provably certifies robustness within an L₂ ball: if the smoothed classifier predicts class A, it’s guaranteed that no perturbation within the certified radius can change the prediction. Interval bound propagation provides similar guarantees for L_∞ perturbations. These methods are slower than empirical defenses but provide the only guarantees we have.

Adversarial NLP reveals that text models are vulnerable too, just with different perturbation types. Synonym substitution, character-level perturbations, universal adversarial triggers (“trigger phrases” that cause any input to be misclassified), and prompt injection attacks all exploit the same core vulnerability: models rely on non-robust features of their input.

The open question that drives ongoing research: is true robustness achievable at scale? Adversarial training improves robustness dramatically but doesn’t eliminate the problem. Certified defenses provide guarantees but at the cost of accuracy. And if we can’t make models robust to imperceptible perturbations — the simplest possible attack — what does that mean for deploying AI in safety-critical systems like autonomous vehicles, medical diagnosis, and content moderation?

References & Further Reading

Szegedy et al. — “Intriguing Properties of Neural Networks” (ICLR 2014) — the paper that started it all, demonstrating adversarial examples in deep networks
Goodfellow, Shlens & Szegedy — “Explaining and Harnessing Adversarial Examples” (ICLR 2015) — introduces FGSM and the linearity hypothesis
Madry et al. — “Towards Deep Learning Models Resistant to Adversarial Attacks” (ICLR 2018) — introduces PGD and adversarial training at scale
Ilyas et al. — “Adversarial Examples Are Not Bugs, They Are Features” (NeurIPS 2019) — the “features not bugs” framework that redefines how we think about adversarial vulnerability
Tsipras et al. — “Robustness May Be at Odds with Accuracy” (ICLR 2019) — proves the inherent tradeoff between standard and robust accuracy
Carlini & Wagner — “Towards Evaluating the Robustness of Neural Networks” (IEEE S&P 2017) — powerful optimization-based attacks that break many proposed defenses