Activation Functions from Scratch: Why Every Neuron Needs a Plot Twist

February 26, 2026 · Elementary · 14 min read

The Deepest Shallow Network

Here's a question that sounds absurd: can a neural network with 100 layers be exactly as powerful as one with a single layer?

The answer is yes — if you forget to include activation functions. Without that one line of nonlinearity between layers, matrix multiplication is associative. Five weight matrices multiplied together collapse into one:

W₅ · W₄ · W₃ · W₂ · W₁ · x = W_combined · x

A hundred-layer "deep" network without activations is just a single linear transformation wearing a trench coat. It can only draw straight decision boundaries — hyperplanes through your data. This means it will fail on any problem where the classes aren't linearly separable.

Let's prove it. XOR is the classic example: the output is 1 when exactly one input is 1, and 0 otherwise. No single straight line can separate the two classes. Watch what happens when we try with and without activations:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

np.random.seed(42)

def train_xor(use_activation, epochs=5000, lr=0.5):
    W1 = np.random.randn(2, 4) * 0.5
    b1 = np.zeros((1, 4))
    W2 = np.random.randn(4, 1) * 0.5
    b2 = np.zeros((1, 1))

    for _ in range(epochs):
        # Forward
        z1 = X @ W1 + b1
        a1 = sigmoid(z1) if use_activation else z1
        z2 = a1 @ W2 + b2
        out = sigmoid(z2)

        # Backward (manual gradients)
        err = out - y
        d2 = err * out * (1 - out)
        if use_activation:
            d1 = (d2 @ W2.T) * a1 * (1 - a1)
        else:
            d1 = d2 @ W2.T

        W2 -= lr * a1.T @ d2
        b2 -= lr * d2.sum(axis=0, keepdims=True)
        W1 -= lr * X.T @ d1
        b1 -= lr * d1.sum(axis=0, keepdims=True)

    z1 = X @ W1 + b1
    a1 = sigmoid(z1) if use_activation else z1
    return sigmoid(a1 @ W2 + b2)

print("WITHOUT activation:")
print(np.round(train_xor(False), 2))
# [[0.5], [0.5], [0.5], [0.5]]  <-- can't learn XOR

print("\nWITH activation:")
print(np.round(train_xor(True), 2))
# [[0.03], [0.97], [0.97], [0.04]]  <-- nails it

Without activation functions, the network outputs roughly 0.5 for everything — it literally cannot represent XOR. With sigmoid activations between layers, it learns perfectly. That one line of nonlinearity is the difference between a network that can only draw straight lines and one that can approximate any function.

So the question isn't whether to use activation functions — it's which one. And the history of that choice is the history of deep learning itself.

The Classics: Sigmoid and Tanh

For nearly two decades (1980s–2000s), sigmoid was the default activation function. Its appeal is obvious: it squashes any input into the range (0, 1), which you can interpret as a probability. Biologically inspired, differentiable everywhere, mathematically elegant.

Tanh is sigmoid's centered cousin — it maps to (-1, 1) instead of (0, 1), which turns out to matter for optimization because zero-centered outputs make gradient updates less biased.

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh(x):
    return np.tanh(x)

def tanh_grad(x):
    return 1 - np.tanh(x) ** 2

# Compare at a few points
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])

print("x:             ", x)
print("sigmoid(x):    ", np.round(sigmoid(x), 4))
print("sigmoid'(x):   ", np.round(sigmoid_grad(x), 4))
print("tanh(x):       ", np.round(tanh(x), 4))
print("tanh'(x):      ", np.round(tanh_grad(x), 4))

# sigmoid'(x):  [0.1050, 0.1966, 0.2500, 0.1966, 0.1050]
# tanh'(x):     [0.0707, 0.4200, 1.0000, 0.4200, 0.0707]

Notice the gradients. Sigmoid's maximum gradient is 0.25 — right at the center where x = 0. Tanh reaches 1.0 at the center, which is better. But both functions share a devastating flaw: their gradients approach zero as inputs get large in either direction. The function "saturates" — no matter how wrong the output is, the gradient signal is tiny.

This is the vanishing gradient problem, and it's the reason deep networks were considered nearly impossible to train before 2010.

Watching Gradients Vanish

During backpropagation, gradients multiply through each layer via the chain rule. If each layer's activation gradient is at most 0.25 (sigmoid's peak), then after 10 layers:

0.25¹⁰ = 0.00000095

The gradient reaching the first layer is less than one millionth of the signal from the loss. Those early layers effectively stop learning. Let's simulate this:

import numpy as np

def simulate_gradient_flow(activation_grad_fn, n_layers=10, input_val=0.5):
    """Simulate backward gradient flow through n layers."""
    gradient = 1.0  # start with gradient = 1 at the output
    layer_gradients = []

    for i in range(n_layers):
        local_grad = activation_grad_fn(input_val)
        gradient *= local_grad
        layer_gradients.append(gradient)

    return layer_gradients

# Sigmoid: gradient vanishes exponentially
sig_grads = simulate_gradient_flow(
    lambda x: 0.25,  # sigmoid's max gradient
    n_layers=10
)

# Tanh: better, but still vanishes
tanh_grads = simulate_gradient_flow(
    lambda x: 0.42,  # tanh's typical gradient near center
    n_layers=10
)

# ReLU: constant gradient (preview of the next section!)
relu_grads = simulate_gradient_flow(
    lambda x: 1.0,   # ReLU gradient for positive inputs
    n_layers=10
)

print("Layer:    ", [f"L{i+1:2d}" for i in range(10)])
print(f"Sigmoid:  {['%.2e' % g for g in sig_grads]}")
print(f"Tanh:     {['%.2e' % g for g in tanh_grads]}")
print(f"ReLU:     {['%.2e' % g for g in relu_grads]}")

# Sigmoid L10: 9.54e-07  (vanished)
# Tanh    L10: 1.73e-04  (nearly vanished)
# ReLU    L10: 1.00e+00  (perfect)

By layer 10, sigmoid has reduced the gradient to less than one millionth. Tanh fares better but still loses four orders of magnitude. ReLU — which we haven't properly introduced yet — passes the gradient through at full strength. This is the chart that changed deep learning.

The ReLU Revolution

In 2010, Nair and Hinton showed that replacing sigmoid with a laughably simple function could unlock deep network training. In 2012, Krizhevsky used it in AlexNet to win ImageNet by a landslide. The function?

ReLU(x) = max(0, x)

That's it. If the input is positive, pass it through unchanged. If negative, output zero. The gradient is 1 for positive inputs and 0 for negative inputs — no saturation, no vanishing, no expensive exponentials. ReLU made deep networks trainable almost overnight.

But ReLU has its own problem: dying neurons. If a neuron's input becomes negative for every example in the training set, its gradient is permanently zero. It will never recover — it's dead. In practice, 10–40% of neurons in a ReLU network can die during training.

Several variants were invented to fix this:

import numpy as np

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    """Small slope for negative inputs prevents dying neurons."""
    return np.where(x > 0, x, alpha * x)

def parametric_relu(x, alpha):
    """Alpha is learned during training, not fixed."""
    return np.where(x > 0, x, alpha * x)

def elu(x, alpha=1.0):
    """Exponential curve for negatives: smooth at zero, saturates at -alpha."""
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

def selu(x, alpha=1.6733, lam=1.0507):
    """Self-normalizing: preserves mean ~0 and variance ~1 through layers."""
    return lam * np.where(x > 0, x, alpha * (np.exp(x) - 1))

# Compare on negative input: which ones keep the gradient alive?
x = np.array([-2.0, -1.0, -0.5, 0.0, 0.5, 1.0, 2.0])

print("x:          ", x)
print("ReLU:       ", np.round(relu(x), 4))
print("Leaky ReLU: ", np.round(leaky_relu(x), 4))
print("ELU:        ", np.round(elu(x), 4))
print("SELU:       ", np.round(selu(x), 4))

# At x=-2:  ReLU=0, Leaky=−0.02, ELU=−0.865, SELU=−1.520
# ReLU kills the signal; the variants keep it alive

Leaky ReLU adds a tiny slope (typically 0.01) for negative inputs — just enough to keep the gradient flowing. Parametric ReLU lets the network learn that slope. ELU uses an exponential curve that saturates smoothly at -α, giving zero-centered outputs. SELU takes this further with carefully chosen constants that make activations self-normalizing — they converge toward mean 0 and variance 1 without explicit normalization layers.

ReLU remains the most common activation in convolutional networks, but transformers have moved on to something smoother.

The Smooth Revolution: GELU, SiLU, and Mish

ReLU has a hard kink at zero — the gradient jumps from 0 to 1 instantaneously. In 2016, Hendrycks and Gimpel proposed GELU, a smooth approximation that's become the default in modern transformers (GPT, BERT, ViT). The idea is beautifully simple: instead of a hard threshold, gate each input by the probability that it's positive under a Gaussian distribution.

GELU(x) = x · Φ(x) where Φ is the Gaussian CDF

import numpy as np
from scipy.special import erf

def gelu_exact(x):
    """GELU using the Gaussian CDF: x * Phi(x)."""
    return 0.5 * x * (1 + erf(x / np.sqrt(2)))

def gelu_approx(x):
    """Fast tanh approximation used in practice (e.g., PyTorch)."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def silu(x):
    """SiLU / Swish: x * sigmoid(x). Nearly identical to GELU."""
    return x * (1 / (1 + np.exp(-x)))

def mish(x):
    """Mish: x * tanh(softplus(x)). Slightly smoother than SiLU."""
    return x * np.tanh(np.log(1 + np.exp(x)))

x = np.array([-3.0, -1.0, -0.5, 0.0, 0.5, 1.0, 3.0])
print("x:          ", x)
print("GELU exact: ", np.round(gelu_exact(x), 4))
print("GELU approx:", np.round(gelu_approx(x), 4))
print("SiLU/Swish: ", np.round(silu(x), 4))
print("Mish:       ", np.round(mish(x), 4))

# At x=-0.5: GELU=-0.1543, SiLU=-0.1888, Mish=-0.1894
# At x= 3.0: all three ≈ 2.996 (converge for large positive)

The striking thing is how similar these three functions are. GELU, SiLU, and Mish were discovered independently, but they all converge on the same shape: pass large positive values through, kill large negative values, and add a smooth bump in the transition zone. This "convergent evolution" suggests the shape itself is what matters, not the specific formula.

For a deeper dive into how GELU and SiLU work inside transformer feed-forward blocks — including the gated variants (SwiGLU) that power LLaMA and Mistral — see the FFN from Scratch post.

Specialized Activations: Softmax and Softplus

Not all activations sit between layers. Softmax is the activation function inside every attention head — it turns raw attention scores into a probability distribution over tokens. And it's the final layer that turns logits into next-token probabilities.

import numpy as np

def softmax(x):
    """Softmax with numerical stability trick (subtract max)."""
    # Without subtracting max, exp(1000) overflows to inf
    shifted = x - np.max(x, axis=-1, keepdims=True)
    exp_x = np.exp(shifted)
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def softplus(x):
    """Smooth approximation to ReLU: log(1 + exp(x))."""
    # Numerically stable version
    return np.where(x > 20, x, np.log(1 + np.exp(x)))

# Softmax: turns arbitrary scores into probabilities
logits = np.array([2.0, 1.0, 0.5, -1.0])
probs = softmax(logits)
print(f"Logits:       {logits}")
print(f"Probabilities: {np.round(probs, 4)}")
print(f"Sum:          {probs.sum():.6f}")  # 1.000000

# Softplus vs ReLU: smooth vs kinked
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f"\nx:        {x}")
print(f"ReLU:     {np.maximum(0, x)}")
print(f"Softplus: {np.round(softplus(x), 4)}")
# At x=0: ReLU=0, Softplus=0.6931 (smooth transition)

The subtract max trick is critical — without it, exp(1000) overflows to infinity. This is the kind of numerical detail that separates working code from code that silently produces NaN. We covered softmax in depth in the softmax and temperature post, including how temperature reshapes the distribution.

Softplus — log(1 + exp(x)) — is a smooth approximation to ReLU. It shows up inside Mish's formula and in various probabilistic models. Unlike ReLU, it has no kink at zero and never outputs exactly zero, which can be useful when you need gradients everywhere.

The Activation Function Zoo

Here's every activation we've built, compared head-to-head:

Function	Formula	Range	Zero-centered	Smooth	Best for
Sigmoid	1/(1+e^-x)	(0, 1)	No	Yes	Binary output
Tanh	(e^x-e^-x)/(e^x+e^-x)	(-1, 1)	Yes	Yes	RNNs, legacy
ReLU	max(0, x)	[0, ∞)	No	No	CNNs, default
Leaky ReLU	max(αx, x)	(-∞, ∞)	~Yes	No	Dying ReLU fix
ELU	x if x>0, α(e^x-1) else	[-α, ∞)	~Yes	Yes	Smooth alternative
SELU	λ · ELU(α, x)	(-λα, ∞)	Yes	Yes	Self-normalizing nets
GELU	x · Φ(x)	(≈-0.17, ∞)	~Yes	Yes	Transformers (default)
SiLU / Swish	x · σ(x)	(≈-0.28, ∞)	~Yes	Yes	Transformers, LLMs
Mish	x · tanh(softplus(x))	(≈-0.31, ∞)	~Yes	Yes	Vision models
Softmax	e^x_i / ∑e^x_j	(0, 1)	No	Yes	Output / attention
Softplus	log(1+e^x)	(0, ∞)	No	Yes	Smooth ReLU approx

The short version: use GELU or SiLU for transformer-based models (they're in every modern LLM). Use ReLU for simpler CNNs or when computational cost matters. Use sigmoid/softmax only in output layers where you need probabilities. Use SELU if you want to skip normalization layers entirely.

How Activation Choice Affects Training

Enough theory — let's race them. We'll train the exact same 4-layer MLP on a synthetic spiral classification problem with eight different activations, tracking loss curves and gradient health per layer.

import numpy as np

# Generate spiral dataset (2 classes, 200 points each)
np.random.seed(42)
N = 200  # points per class
theta = np.linspace(0, 4 * np.pi, N)
r = np.linspace(0.3, 1, N)
X_a = np.column_stack([r * np.cos(theta) + 0.05 * np.random.randn(N),
                        r * np.sin(theta) + 0.05 * np.random.randn(N)])
X_b = np.column_stack([r * np.cos(theta + np.pi) + 0.05 * np.random.randn(N),
                        r * np.sin(theta + np.pi) + 0.05 * np.random.randn(N)])
X = np.vstack([X_a, X_b])
y = np.hstack([np.zeros(N), np.ones(N)]).reshape(-1, 1)

# MLP: 2 -> 32 -> 32 -> 32 -> 1 (4 layers)
activations = {
    'sigmoid': (lambda x: 1/(1+np.exp(-np.clip(x,-500,500))),
                lambda x: (s:=1/(1+np.exp(-np.clip(x,-500,500))))*(1-s)),
    'tanh':    (np.tanh, lambda x: 1 - np.tanh(x)**2),
    'relu':    (lambda x: np.maximum(0,x), lambda x: (x > 0).astype(float)),
    'lrelu':   (lambda x: np.where(x>0,x,0.01*x),
                lambda x: np.where(x>0,1.0,0.01)),
    'elu':     (lambda x: np.where(x>0,x,np.exp(x)-1),
                lambda x: np.where(x>0,1.0,np.exp(x))),
    'gelu':    (lambda x: 0.5*x*(1+np.tanh(np.sqrt(2/np.pi)*(x+0.044715*x**3))),
                None),  # use numerical grad
    'silu':    (lambda x: x/(1+np.exp(-x)),
                None),  # use numerical grad
    'mish':    (lambda x: x*np.tanh(np.log(1+np.exp(np.clip(x,-20,20)))),
                None),
}

def train_mlp(act_fn, act_grad_fn, epochs=800, lr=0.01):
    """Train a 4-layer MLP and return loss history + gradient norms."""
    np.random.seed(42)  # same init weights for fair comparison
    dims = [2, 32, 32, 32, 1]
    W = [np.random.randn(dims[i], dims[i+1]) * np.sqrt(2/dims[i])
         for i in range(4)]
    b = [np.zeros((1, dims[i+1])) for i in range(4)]

    losses = []
    for epoch in range(epochs):
        # Forward pass — store pre-activations and activations
        a = [X]
        z_list = []
        for i in range(4):
            z = a[-1] @ W[i] + b[i]
            z_list.append(z)
            if i < 3:
                a.append(act_fn(z))
            else:
                a.append(1 / (1 + np.exp(-np.clip(z, -500, 500))))

        # Binary cross-entropy loss
        out = np.clip(a[-1], 1e-7, 1-1e-7)
        loss = -np.mean(y*np.log(out) + (1-y)*np.log(1-out))
        losses.append(loss)

        # Backward pass
        delta = out - y
        for i in range(3, -1, -1):
            dW = a[i].T @ delta / len(X)
            db = delta.mean(axis=0, keepdims=True)
            if i > 0:
                delta = delta @ W[i].T
                if act_grad_fn:
                    delta *= act_grad_fn(z_list[i-1])
                else:
                    eps = 1e-5
                    delta *= (act_fn(z_list[i-1]+eps)-act_fn(z_list[i-1]-eps))/(2*eps)
            W[i] -= lr * dW
            b[i] -= lr * db

    return losses

# Run all activations
results = {}
for name, (fn, grad_fn) in activations.items():
    results[name] = train_mlp(fn, grad_fn)
    final_loss = results[name][-1]
    print(f"{name:8s}: final loss = {final_loss:.4f}")

Typical results on this spiral task (your exact numbers may differ with different random seeds):

Activation	Final Loss	Converged by Epoch	Verdict
Sigmoid	~0.68	Never	Stuck (vanishing grads)
Tanh	~0.35	~600	Slow but learns
ReLU	~0.08	~300	Fast, some dead neurons
Leaky ReLU	~0.06	~250	Fast, no dead neurons
ELU	~0.05	~250	Fast, smooth gradients
GELU	~0.04	~200	Fastest convergence
SiLU	~0.04	~200	Neck and neck with GELU
Mish	~0.04	~220	Slightly behind, same ballpark

The pattern is clear: sigmoid can't handle depth (it gets stuck at ~0.68, barely better than random). Tanh learns but slowly. ReLU and its variants work well. GELU and SiLU edge out the competition with slightly faster convergence and lower final loss — consistent with why they've become the default in transformers.

The improvement from GELU over ReLU is small on toy problems. On large-scale models with billions of parameters, that small edge compounds across thousands of layers and training steps. That's why the choice matters.

Try It: Activation Explorer

Select activation functions to compare their shapes and derivatives side by side. Use the overlay checkbox to stack multiple functions.

Function 1: Overlay: α: 0.01

Left: f(x) | Right: f'(x) derivative | Range: x ∈ [-5, 5]

Try It: Gradient Flow Visualizer

See how gradients flow backward through a 6-layer network. Green means healthy gradients, red means vanishing. Pick an activation and watch the difference.

Activation:

←←← Gradient flows backward (output to input) ←←←

The One-Line Function That Changed Everything

The arc of activation functions mirrors the arc of deep learning itself. Sigmoid reigned for decades, held back by its vanishing gradients. ReLU — one line of code — broke the dam and made deep networks trainable. Then GELU and SiLU refined the idea with smooth curves that squeeze out a few extra percentage points of performance, which at scale translates to meaningfully better models.

The lesson is almost philosophical: the simplest change (one line of code per neuron) had the biggest impact on what neural networks could learn. Not a new architecture, not a new training algorithm — just a better choice of nonlinearity.

If you've been following the from-scratch series, activation functions connect everything: they enable the nonlinear learning in micrograd, they sit between the linear projections in feed-forward networks, they produce the attention weights via softmax in attention, and the optimizer navigates the loss landscape they create.

Every neuron in every network on this site uses one. Now you know why.

References & Further Reading

Glorot & Bengio 2011 — Understanding the difficulty of training deep feedforward neural networks — the paper that characterized the vanishing gradient problem in deep networks
Nair & Hinton 2010 — Rectified Linear Units Improve Restricted Boltzmann Machines — the paper that introduced ReLU to mainstream deep learning
Hendrycks & Gimpel 2016 — Gaussian Error Linear Units (GELUs) — the activation used in GPT, BERT, and most modern transformers
Ramachandran et al. 2017 — Searching for Activation Functions — the automated search that discovered Swish/SiLU
Misra 2019 — Mish: A Self Regularized Non-Monotonic Activation Function — the smooth activation used in YOLOv4 and other vision models
Clevert et al. 2015 — Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) — exponential negative saturation for cleaner gradients
Klambauer et al. 2017 — Self-Normalizing Neural Networks — SELU and the theory of self-normalizing networks