Micrograd from Scratch: Teaching Python to Do Calculus
Why Gradients Matter
Every neural network ever trained owes its existence to one trick: numbers that remember how they were made.
Here's the problem. Training a neural network means adjusting thousands (or billions) of parameters to minimize a loss function. To know which direction to adjust each parameter, you need derivatives — the gradient of the loss with respect to every single parameter. Computing those derivatives by hand is impossible for any real network. Computing them numerically (wiggle each parameter, see what happens) is absurdly slow.
The solution is automatic differentiation: build your computation so that every number remembers the operations that created it, then walk backwards through that history to compute all gradients in a single pass. This is what PyTorch, TensorFlow, and JAX do under the hood — and it's what we're going to build from scratch in about 100 lines of Python.
Our engine is inspired by Andrej Karpathy's micrograd — a beautifully minimal autograd engine that operates on individual scalar values. We'll write our own implementation from zero, with our own explanations and our own code. By the end, you'll have a working autograd engine and a tiny neural network that learns.
A Number with Memory
In normal Python, numbers forget their history immediately:
a = 2.0
b = -3.0
c = a * b # c is -6.0, but it has no idea it came from a * b
We need something richer. We need a number that carries a receipt — a record of where it came from, who its parents were, and what operation created it. Let's call it Value:
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data # the actual number
self.grad = 0.0 # derivative of the output w.r.t. this value
self._backward = lambda: None # function to compute local gradients
self._prev = set(_children) # parent Values that produced this one
self._op = _op # the operation that created this (for debugging)
def __repr__(self):
return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
Every Value holds its numeric data, a grad field (initially zero — we'll fill it during the backward pass), and pointers to the _prev Values that were combined to create it. The _backward function is a closure that knows how to push gradients back to those parents. Think of it as a receipt stapled to the number: "I was made by multiplying these two values together."
Right now our Values can't do any math. Let's fix that.
Teaching Arithmetic
Python's operator overloading lets us define what +, *, and other operators do for our Value class. Each operator does three things: compute the result, link it to its parents, and store a _backward closure that knows the local gradient rule.
Let's start with addition and multiplication:
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad # d(a+b)/da = 1, so gradient passes through
other.grad += out.grad # d(a+b)/db = 1, same thing
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad # d(a*b)/da = b
other.grad += self.data * out.grad # d(a*b)/db = a
out._backward = _backward
return out
Notice the pattern: each _backward function applies the local gradient rule from calculus and multiplies by out.grad (the upstream gradient flowing in from later in the computation). This is the chain rule in action.
The+=is critical here, not=. If a Value is used in more than one operation, gradients arrive from multiple paths and must be summed. This is the multivariate chain rule — each path through the graph contributes independently.
The multiplication gate has a beautiful property: it swaps values. The gradient for self is other.data, and vice versa. If you multiply a tiny number by a huge number, the tiny number gets a huge gradient (it's very sensitive) and the huge number gets a tiny gradient.
We need a few more operations to build a neural network. Power, negation, subtraction, division, and the tanh activation function:
import math
def __pow__(self, other):
assert isinstance(other, (int, float)) # only constant exponents
out = Value(self.data ** other, (self,), f'**{other}')
def _backward():
self.grad += (other * self.data ** (other - 1)) * out.grad
out._backward = _backward
return out
def tanh(self):
t = math.tanh(self.data)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1 - t ** 2) * out.grad
out._backward = _backward
return out
# Convenience methods — these build on the operators above
def __neg__(self):
return self * -1
def __sub__(self, other):
return self + (-other)
def __truediv__(self, other):
return self * other ** -1
# Handle cases like 2.0 + Value(3.0) where Python tries int.__add__ first
def __radd__(self, other):
return self + other
def __rmul__(self, other):
return self * other
The tanh gradient has an elegant property: 1 - t² where t is the output value. The gradient is strongest when the output is near zero (nearly linear behavior) and weakest when the output is near ±1 (saturation). This is the origin of the vanishing gradient problem — stack many tanh layers and gradients shrink exponentially.
Now we can write normal-looking math and a computation graph silently builds itself behind the scenes:
x1 = Value(2.0)
w1 = Value(-3.0)
x2 = Value(0.0)
w2 = Value(1.0)
b = Value(6.88)
# A single neuron: weighted sum + tanh activation
n = x1*w1 + x2*w2 + b # graph building silently...
o = n.tanh()
print(o) # Value(data=0.7071, grad=0.0000)
The output is computed, and hiding behind o is a complete graph of every operation that produced it. Now we need to walk it backwards.
The Backward Pass
The backward pass answers one question: how sensitive is the output to each input? Or in calculus terms: what is ∂o/∂x for every value x in the graph?
The algorithm is beautifully simple:
- Set the output's gradient to 1.0 (the output is perfectly sensitive to itself: ∂o/∂o = 1)
- Walk through every node in reverse topological order (output first, inputs last)
- At each node, call its
_backward()function to push gradients to its parents
Why topological order? Because a node can't propagate its gradient until it has received all gradient contributions from its children. Processing output-to-input guarantees this.
def backward(self):
# Build topological order via depth-first search
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for parent in v._prev:
build_topo(parent)
topo.append(v)
build_topo(self)
# Seed the output gradient
self.grad = 1.0
# Walk in reverse: output → inputs
for v in reversed(topo):
v._backward()
The DFS visits all parents before appending the current node, so reversing the list gives us the correct output-to-input order. Let's run it on our single neuron:
o.backward()
print(f"x1.grad = {x1.grad:.4f}") # -1.5000
print(f"w1.grad = {w1.grad:.4f}") # 1.0000
print(f"x2.grad = {x2.grad:.4f}") # 0.5000
print(f"w2.grad = {w2.grad:.4f}") # 0.0000
print(f"b.grad = {b.grad:.4f}") # 0.5000
Read these gradients as recommendations: "To increase the output, decrease x1 (gradient is negative), increase x2 (gradient is positive), and increase b (gradient is positive)." The gradient for w2 is zero because x2 is zero — changing a weight on a zero input has no effect.
Let's verify with a concrete example. The gradient for x1 is −1.5. This means: if we nudge x1 from 2.0 to 2.001, the output should change by approximately −1.5 × 0.001 = −0.0015. You can verify this numerically — the autograd engine gets it exactly right.
Try It: Computation Graph Visualizer
Interactive: Gradient Flow in a Single Neuron
This visualizes o = tanh(x1·w1 + x2·w2 + b) with fixed weights w1=−3, w2=1, b=6.88.
Drag the sliders to change inputs and watch how gradients flow backward through the graph.
Play with the sliders and watch the gradient values update. Notice how changing x1 affects gradients throughout the graph — particularly how the multiplication gate swaps values (x1's gradient depends on w1's data, and vice versa). When x2 is zero, w2's gradient vanishes completely — you can't learn a weight when its input contributes nothing.
From Values to Neurons
With our autograd engine complete, building a neural network is surprisingly straightforward. A neuron is just a weighted sum plus a bias, passed through an activation function — exactly the expression we've been working with:
import random
class Neuron:
def __init__(self, n_inputs):
self.w = [Value(random.uniform(-1, 1)) for _ in range(n_inputs)]
self.b = Value(0.0)
def __call__(self, x):
# w1*x1 + w2*x2 + ... + b
activation = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)
return activation.tanh()
def parameters(self):
return self.w + [self.b]
A Layer is a row of neurons that all take the same input. An MLP (multi-layer perceptron) stacks layers, feeding each layer's output into the next:
class Layer:
def __init__(self, n_inputs, n_outputs):
self.neurons = [Neuron(n_inputs) for _ in range(n_outputs)]
def __call__(self, x):
out = [neuron(x) for neuron in self.neurons]
return out[0] if len(out) == 1 else out
def parameters(self):
return [p for neuron in self.neurons for p in neuron.parameters()]
class MLP:
def __init__(self, n_inputs, layer_sizes):
sizes = [n_inputs] + layer_sizes
self.layers = [Layer(sizes[i], sizes[i + 1]) for i in range(len(layer_sizes))]
def __call__(self, x):
for layer in self.layers:
x = layer(x)
return x
def parameters(self):
return [p for layer in self.layers for p in layer.parameters()]
Every weight and bias is a Value. Every forward pass builds a fresh computation graph. Every call to .backward() computes gradients for the entire network in one pass.
model = MLP(3, [4, 4, 1])
print(f"Number of parameters: {len(model.parameters())}")
# Layer 1: 4 neurons × (3 weights + 1 bias) = 16
# Layer 2: 4 neurons × (4 weights + 1 bias) = 20
# Output: 1 neuron × (4 weights + 1 bias) = 5
# Total: 41 parameters
Training: Gradient Descent in Action
Let's train our network on a tiny dataset. We have 4 input-output pairs and we want the network to learn the mapping:
# Training data: 4 examples with 3 features each
xs = [
[2.0, 3.0, -1.0],
[3.0, -1.0, 0.5],
[0.5, 1.0, 1.0],
[1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0] # desired outputs
The training loop has five steps: forward pass, compute loss, zero gradients, backward pass, and parameter update. Watch the critical zero_grad in step 3 — skip it and your gradients accumulate across iterations, producing nonsense updates:
for step in range(50):
# 1. Forward pass — make predictions
predictions = [model(x) for x in xs]
# 2. Loss — mean squared error
loss = sum((pred - target) ** 2 for pred, target in zip(predictions, ys))
# 3. Zero all gradients (CRITICAL — backward() uses +=)
for p in model.parameters():
p.grad = 0.0
# 4. Backward pass — compute all gradients
loss.backward()
# 5. Update — nudge each parameter in the direction that reduces loss
learning_rate = 0.05
for p in model.parameters():
p.data -= learning_rate * p.grad
if step % 10 == 0:
print(f"Step {step:3d} | Loss: {loss.data:.6f}")
Running this produces something like:
Step 0 | Loss: 4.836271
Step 10 | Loss: 0.036174
Step 20 | Loss: 0.009653
Step 30 | Loss: 0.004763
Step 40 | Loss: 0.002930
The loss plummets from ~4.8 to ~0.003 in 50 steps. Let's check the predictions after training:
[model(x).data for x in xs]
# [0.9672, -0.9748, -0.9686, 0.9653]
# Target: [1.0, -1.0, -1.0, 1.0] — very close!
Forty-one parameters, fifty gradient descent steps, and our network learned a nonlinear decision boundary from scratch — using an autograd engine we built ourselves.
From 100 Lines to PyTorch
We built a complete autograd engine and a working neural network in roughly 100 lines of Python. The core insight is this: by giving numbers memory (tracking their parents and the operations that created them) and a backward function (the local gradient rule), we can differentiate arbitrarily complex expressions automatically.
The leap from micrograd to PyTorch is one of efficiency, not concepts. PyTorch operates on tensors instead of scalars, executes on GPUs instead of CPUs, and uses fused CUDA kernels instead of Python closures. But the mathematical machinery — the forward pass building a graph, the backward pass propagating gradients via the chain rule, the topological ordering — is identical.
Every time PyTorch computes a gradient, it's doing exactly what our little Value class does. Just faster, and with much bigger numbers.
If you want to go deeper, try extending the engine: add relu() and sigmoid() activations, implement a softmax output layer, or build a small classifier for real data. The beauty of understanding the fundamentals is that everything built on top becomes transparent.
References & Further Reading
- Andrej Karpathy — micrograd — The tiny autograd engine that inspired this post. 94 lines of pure Python magic.
- Andrej Karpathy — "The spelled-out intro to neural networks and backpropagation" — A 2.5-hour video building micrograd from scratch, with incredible clarity.
- Chris Olah — "Calculus on Computational Graphs: Backpropagation" — The best visual explanation of how backprop factors the chain rule across a graph.
- Stanford CS231n — Backpropagation, Intuitions — Gate-level intuitions (add gates distribute, multiply gates swap, max gates route).
- Rufflewind — Reverse-Mode Automatic Differentiation: A Tutorial — Clear comparison of forward-mode vs. reverse-mode autodiff.