← Back to Blog

Micrograd from Scratch: Teaching Python to Do Calculus

Why Gradients Matter

Every neural network ever trained owes its existence to one trick: numbers that remember how they were made.

Here's the problem. Training a neural network means adjusting thousands (or billions) of parameters to minimize a loss function. To know which direction to adjust each parameter, you need derivatives — the gradient of the loss with respect to every single parameter. Computing those derivatives by hand is impossible for any real network. Computing them numerically (wiggle each parameter, see what happens) is absurdly slow.

The solution is automatic differentiation: build your computation so that every number remembers the operations that created it, then walk backwards through that history to compute all gradients in a single pass. This is what PyTorch, TensorFlow, and JAX do under the hood — and it's what we're going to build from scratch in about 100 lines of Python.

Our engine is inspired by Andrej Karpathy's micrograd — a beautifully minimal autograd engine that operates on individual scalar values. We'll write our own implementation from zero, with our own explanations and our own code. By the end, you'll have a working autograd engine and a tiny neural network that learns.

A Number with Memory

In normal Python, numbers forget their history immediately:

a = 2.0
b = -3.0
c = a * b  # c is -6.0, but it has no idea it came from a * b

We need something richer. We need a number that carries a receipt — a record of where it came from, who its parents were, and what operation created it. Let's call it Value:

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data              # the actual number
        self.grad = 0.0               # derivative of the output w.r.t. this value
        self._backward = lambda: None # function to compute local gradients
        self._prev = set(_children)   # parent Values that produced this one
        self._op = _op                # the operation that created this (for debugging)

    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"

Every Value holds its numeric data, a grad field (initially zero — we'll fill it during the backward pass), and pointers to the _prev Values that were combined to create it. The _backward function is a closure that knows how to push gradients back to those parents. Think of it as a receipt stapled to the number: "I was made by multiplying these two values together."

Right now our Values can't do any math. Let's fix that.

Teaching Arithmetic

Python's operator overloading lets us define what +, *, and other operators do for our Value class. Each operator does three things: compute the result, link it to its parents, and store a _backward closure that knows the local gradient rule.

Let's start with addition and multiplication:

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')

    def _backward():
        self.grad += out.grad     # d(a+b)/da = 1, so gradient passes through
        other.grad += out.grad    # d(a+b)/db = 1, same thing
    out._backward = _backward
    return out

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data * other.data, (self, other), '*')

    def _backward():
        self.grad += other.data * out.grad   # d(a*b)/da = b
        other.grad += self.data * out.grad   # d(a*b)/db = a
    out._backward = _backward
    return out

Notice the pattern: each _backward function applies the local gradient rule from calculus and multiplies by out.grad (the upstream gradient flowing in from later in the computation). This is the chain rule in action.

The += is critical here, not =. If a Value is used in more than one operation, gradients arrive from multiple paths and must be summed. This is the multivariate chain rule — each path through the graph contributes independently.

The multiplication gate has a beautiful property: it swaps values. The gradient for self is other.data, and vice versa. If you multiply a tiny number by a huge number, the tiny number gets a huge gradient (it's very sensitive) and the huge number gets a tiny gradient.

We need a few more operations to build a neural network. Power, negation, subtraction, division, and the tanh activation function:

import math

def __pow__(self, other):
    assert isinstance(other, (int, float))  # only constant exponents
    out = Value(self.data ** other, (self,), f'**{other}')

    def _backward():
        self.grad += (other * self.data ** (other - 1)) * out.grad
    out._backward = _backward
    return out

def tanh(self):
    t = math.tanh(self.data)
    out = Value(t, (self,), 'tanh')

    def _backward():
        self.grad += (1 - t ** 2) * out.grad
    out._backward = _backward
    return out

# Convenience methods — these build on the operators above
def __neg__(self):
    return self * -1

def __sub__(self, other):
    return self + (-other)

def __truediv__(self, other):
    return self * other ** -1

# Handle cases like  2.0 + Value(3.0)  where Python tries int.__add__ first
def __radd__(self, other):
    return self + other

def __rmul__(self, other):
    return self * other

The tanh gradient has an elegant property: 1 - t² where t is the output value. The gradient is strongest when the output is near zero (nearly linear behavior) and weakest when the output is near ±1 (saturation). This is the origin of the vanishing gradient problem — stack many tanh layers and gradients shrink exponentially.

Now we can write normal-looking math and a computation graph silently builds itself behind the scenes:

x1 = Value(2.0)
w1 = Value(-3.0)
x2 = Value(0.0)
w2 = Value(1.0)
b  = Value(6.88)

# A single neuron: weighted sum + tanh activation
n = x1*w1 + x2*w2 + b   # graph building silently...
o = n.tanh()

print(o)  # Value(data=0.7071, grad=0.0000)

The output is computed, and hiding behind o is a complete graph of every operation that produced it. Now we need to walk it backwards.

The Backward Pass

The backward pass answers one question: how sensitive is the output to each input? Or in calculus terms: what is ∂o/∂x for every value x in the graph?

The algorithm is beautifully simple:

  1. Set the output's gradient to 1.0 (the output is perfectly sensitive to itself: ∂o/∂o = 1)
  2. Walk through every node in reverse topological order (output first, inputs last)
  3. At each node, call its _backward() function to push gradients to its parents

Why topological order? Because a node can't propagate its gradient until it has received all gradient contributions from its children. Processing output-to-input guarantees this.

def backward(self):
    # Build topological order via depth-first search
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for parent in v._prev:
                build_topo(parent)
            topo.append(v)
    build_topo(self)

    # Seed the output gradient
    self.grad = 1.0

    # Walk in reverse: output → inputs
    for v in reversed(topo):
        v._backward()

The DFS visits all parents before appending the current node, so reversing the list gives us the correct output-to-input order. Let's run it on our single neuron:

o.backward()

print(f"x1.grad = {x1.grad:.4f}")  # -1.5000
print(f"w1.grad = {w1.grad:.4f}")  #  1.0000
print(f"x2.grad = {x2.grad:.4f}")  #  0.5000
print(f"w2.grad = {w2.grad:.4f}")  #  0.0000
print(f"b.grad  = {b.grad:.4f}")   #  0.5000

Read these gradients as recommendations: "To increase the output, decrease x1 (gradient is negative), increase x2 (gradient is positive), and increase b (gradient is positive)." The gradient for w2 is zero because x2 is zero — changing a weight on a zero input has no effect.

Let's verify with a concrete example. The gradient for x1 is −1.5. This means: if we nudge x1 from 2.0 to 2.001, the output should change by approximately −1.5 × 0.001 = −0.0015. You can verify this numerically — the autograd engine gets it exactly right.

Try It: Computation Graph Visualizer

Interactive: Gradient Flow in a Single Neuron

This visualizes o = tanh(x1·w1 + x2·w2 + b) with fixed weights w1=−3, w2=1, b=6.88. Drag the sliders to change inputs and watch how gradients flow backward through the graph.

Play with the sliders and watch the gradient values update. Notice how changing x1 affects gradients throughout the graph — particularly how the multiplication gate swaps values (x1's gradient depends on w1's data, and vice versa). When x2 is zero, w2's gradient vanishes completely — you can't learn a weight when its input contributes nothing.

From Values to Neurons

With our autograd engine complete, building a neural network is surprisingly straightforward. A neuron is just a weighted sum plus a bias, passed through an activation function — exactly the expression we've been working with:

import random

class Neuron:
    def __init__(self, n_inputs):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(n_inputs)]
        self.b = Value(0.0)

    def __call__(self, x):
        # w1*x1 + w2*x2 + ... + b
        activation = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)
        return activation.tanh()

    def parameters(self):
        return self.w + [self.b]

A Layer is a row of neurons that all take the same input. An MLP (multi-layer perceptron) stacks layers, feeding each layer's output into the next:

class Layer:
    def __init__(self, n_inputs, n_outputs):
        self.neurons = [Neuron(n_inputs) for _ in range(n_outputs)]

    def __call__(self, x):
        out = [neuron(x) for neuron in self.neurons]
        return out[0] if len(out) == 1 else out

    def parameters(self):
        return [p for neuron in self.neurons for p in neuron.parameters()]


class MLP:
    def __init__(self, n_inputs, layer_sizes):
        sizes = [n_inputs] + layer_sizes
        self.layers = [Layer(sizes[i], sizes[i + 1]) for i in range(len(layer_sizes))]

    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

Every weight and bias is a Value. Every forward pass builds a fresh computation graph. Every call to .backward() computes gradients for the entire network in one pass.

model = MLP(3, [4, 4, 1])
print(f"Number of parameters: {len(model.parameters())}")
# Layer 1: 4 neurons × (3 weights + 1 bias) = 16
# Layer 2: 4 neurons × (4 weights + 1 bias) = 20
# Output:  1 neuron  × (4 weights + 1 bias) = 5
# Total: 41 parameters

Training: Gradient Descent in Action

Let's train our network on a tiny dataset. We have 4 input-output pairs and we want the network to learn the mapping:

# Training data: 4 examples with 3 features each
xs = [
    [2.0, 3.0, -1.0],
    [3.0, -1.0, 0.5],
    [0.5, 1.0, 1.0],
    [1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0]  # desired outputs

The training loop has five steps: forward pass, compute loss, zero gradients, backward pass, and parameter update. Watch the critical zero_grad in step 3 — skip it and your gradients accumulate across iterations, producing nonsense updates:

for step in range(50):

    # 1. Forward pass — make predictions
    predictions = [model(x) for x in xs]

    # 2. Loss — mean squared error
    loss = sum((pred - target) ** 2 for pred, target in zip(predictions, ys))

    # 3. Zero all gradients (CRITICAL — backward() uses +=)
    for p in model.parameters():
        p.grad = 0.0

    # 4. Backward pass — compute all gradients
    loss.backward()

    # 5. Update — nudge each parameter in the direction that reduces loss
    learning_rate = 0.05
    for p in model.parameters():
        p.data -= learning_rate * p.grad

    if step % 10 == 0:
        print(f"Step {step:3d} | Loss: {loss.data:.6f}")

Running this produces something like:

Step   0 | Loss: 4.836271
Step  10 | Loss: 0.036174
Step  20 | Loss: 0.009653
Step  30 | Loss: 0.004763
Step  40 | Loss: 0.002930

The loss plummets from ~4.8 to ~0.003 in 50 steps. Let's check the predictions after training:

[model(x).data for x in xs]
# [0.9672, -0.9748, -0.9686, 0.9653]
# Target: [1.0, -1.0, -1.0, 1.0] — very close!

Forty-one parameters, fifty gradient descent steps, and our network learned a nonlinear decision boundary from scratch — using an autograd engine we built ourselves.

From 100 Lines to PyTorch

We built a complete autograd engine and a working neural network in roughly 100 lines of Python. The core insight is this: by giving numbers memory (tracking their parents and the operations that created them) and a backward function (the local gradient rule), we can differentiate arbitrarily complex expressions automatically.

The leap from micrograd to PyTorch is one of efficiency, not concepts. PyTorch operates on tensors instead of scalars, executes on GPUs instead of CPUs, and uses fused CUDA kernels instead of Python closures. But the mathematical machinery — the forward pass building a graph, the backward pass propagating gradients via the chain rule, the topological ordering — is identical.

Every time PyTorch computes a gradient, it's doing exactly what our little Value class does. Just faster, and with much bigger numbers.

If you want to go deeper, try extending the engine: add relu() and sigmoid() activations, implement a softmax output layer, or build a small classifier for real data. The beauty of understanding the fundamentals is that everything built on top becomes transparent.

References & Further Reading