Symbolic Regression from Scratch

February 28, 2026 · Elementary · 14 min read

1. Why Discover Equations?

Neural networks are remarkable approximators. Give one enough parameters and training data, and it will learn to map inputs to outputs with eerie precision. But ask it what it learned, and you get millions of floating-point weights staring back at you. The function is there, buried in matrix multiplications — but you can't read it.

What if your model could hand you the actual equation?

That's the promise of symbolic regression: instead of fitting a fixed-architecture model to data, we search the space of mathematical expressions to find the formula that best explains the observations. The output isn't a prediction — it's y = 2.3x² + sin(x).

This matters for three reasons:

Interpretability — a formula you can read, verify, and reason about
Extrapolation — analytic expressions generalize beyond the training range; neural nets don't
Scientific discovery — Kepler's laws, conservation principles, and scaling relations were all equations waiting to be found in data

The classic tool for this job is genetic programming (GP) — an evolutionary algorithm that breeds, mutates, and selects mathematical expressions encoded as trees. In this post, we'll build every piece from scratch: expression trees, population initialization, crossover and mutation operators, fitness with parsimony pressure, constant optimization, and the full GP loop. Then we'll watch it rediscover physics equations from noisy data.

2. Encoding Math as Trees

Every mathematical expression has a natural tree structure. Consider x² + sin(x): the root is +, with two children — * (whose children are both x) and sin (whose child is x). Internal nodes are operators (functions with known arity), and leaves are terminals (variables or constants).

We need two ingredient sets:

Function set: operators like +, -, *, /, sin — each with a known arity (number of children)
Terminal set: variables (x) and ephemeral random constants (ERCs) — random floats drawn once and fixed in the tree

One crucial detail: division by zero will crash our evolution. We use protected division that returns 1.0 when the denominator is near zero. This keeps every tree evaluable, no matter how wild the mutations get.

import numpy as np

class Node:
    """A single node in an expression tree."""
    def __init__(self, value, children=None):
        self.value = value          # operator string or float/str terminal
        self.children = children or []

    def evaluate(self, x):
        """Evaluate the tree for input value(s) x (numpy array)."""
        if not self.children:
            # Terminal: variable or constant
            if self.value == 'x':
                return x
            return np.full_like(x, float(self.value), dtype=float)

        args = [c.evaluate(x) for c in self.children]
        if self.value == '+':   return args[0] + args[1]
        if self.value == '-':   return args[0] - args[1]
        if self.value == '*':   return args[0] * args[1]
        if self.value == '/':   # protected division
            safe = np.where(np.abs(args[1]) < 1e-10, 1.0, args[1])
            return args[0] / safe
        if self.value == 'sin': return np.sin(args[0])
        return np.zeros_like(x)

    def size(self):
        return 1 + sum(c.size() for c in self.children)

    def depth(self):
        if not self.children:
            return 0
        return 1 + max(c.depth() for c in self.children)

    def __str__(self):
        if not self.children:
            if isinstance(self.value, float):
                return f"{self.value:.2f}"
            return str(self.value)
        if self.value in ('+', '-', '*', '/'):
            return f"({self.children[0]} {self.value} {self.children[1]})"
        return f"{self.value}({self.children[0]})"

Each Node stores an operator string or a terminal value, plus a list of children. The evaluate method recurses down the tree, computing the expression for an entire NumPy array at once — vectorized evaluation is essential when fitness calls must be fast across hundreds of individuals per generation.

3. Growing a Population

We need to create an initial population of random expression trees. There are two basic strategies for growing a tree to a given depth:

Full: every branch extends to exactly max_depth — all leaves sit at the same level, producing bushy, symmetric trees
Grow: at each node, randomly choose between a function and a terminal. This allows early termination, producing thinner, sparser trees

John Koza's Ramped Half-and-Half (1992) combines both: split the population across depths 2 through max_depth, and at each depth, half use Full and half use Grow. This seeds the evolution with diverse tree shapes and sizes — critical for avoiding premature convergence.

FUNCTIONS = {'+': 2, '-': 2, '*': 2, '/': 2, 'sin': 1}
TERMINALS = ['x']  # plus ephemeral random constants

def random_terminal(rng):
    if rng.random() < 0.5:
        return Node('x')
    return Node(round(rng.uniform(-3, 3), 2))

def grow_tree(depth, max_depth, rng, full=False):
    """Recursively grow a random expression tree."""
    if depth == max_depth or (not full and depth > 0 and rng.random() < 0.3):
        return random_terminal(rng)
    func = rng.choice(list(FUNCTIONS.keys()))
    arity = FUNCTIONS[func]
    children = [grow_tree(depth + 1, max_depth, rng, full) for _ in range(arity)]
    return Node(func, children)

def init_population(pop_size, max_depth, rng):
    """Ramped Half-and-Half initialization."""
    population = []
    depths = range(2, max_depth + 1)
    for i in range(pop_size):
        d = depths[i % len(depths)]
        use_full = (i // len(depths)) % 2 == 0
        population.append(grow_tree(0, d, rng, full=use_full))
    return population

Notice the 0.3 termination probability in grow_tree — this controls how aggressively Grow trees prune themselves. Too high and you get mostly trivial single-node trees; too low and Grow behaves like Full.

4. Evolution — Selection, Crossover, and Mutation

With a population of trees in hand, we evolve them through the same cycle that drives all genetic algorithms: select the fittest, recombine their genetic material, and mutate for fresh variation. But expression trees need their own specialized operators.

Tournament selection picks k individuals at random and keeps the best. The tournament size k controls selection pressure: k = 2 is gentle, k = 7 is aggressive.

Subtree crossover is the dominant operator (~90% of offspring). Pick a random node in each parent, then swap those subtrees. The children inherit structure from both parents — a powerful recombination that can assemble complex expressions from simpler building blocks.

Point mutation replaces a single node with a compatible alternative: a function with the same arity, or a new terminal for leaves. Hoist mutation replaces a subtree with one of its own descendants — a shrinking operator that fights bloat.

import copy

def get_random_node(tree, rng):
    """Return a random node and its parent (None if root)."""
    nodes = []
    def collect(node, parent, idx):
        nodes.append((node, parent, idx))
        for i, c in enumerate(node.children):
            collect(c, node, i)
    collect(tree, None, 0)
    return nodes[rng.randint(len(nodes))]

def tournament_select(population, fitnesses, k, rng):
    idxs = rng.choice(len(population), size=k, replace=False)
    best = idxs[0]
    for i in idxs[1:]:
        if fitnesses[i] < fitnesses[best]:
            best = i
    return copy.deepcopy(population[best])

def subtree_crossover(p1, p2, rng, max_depth=8):
    c1, c2 = copy.deepcopy(p1), copy.deepcopy(p2)
    n1, par1, idx1 = get_random_node(c1, rng)
    n2, par2, idx2 = get_random_node(c2, rng)
    if par1 is None:
        c1 = n2
    else:
        par1.children[idx1] = n2
    if c1.depth() > max_depth:
        return copy.deepcopy(p1)  # reject if too deep
    return c1

def point_mutation(tree, rng):
    tree = copy.deepcopy(tree)
    node, parent, idx = get_random_node(tree, rng)
    if not node.children:
        new = random_terminal(rng)
        if parent is None:
            return new
        parent.children[idx] = new
    else:
        arity = len(node.children)
        candidates = [f for f, a in FUNCTIONS.items() if a == arity]
        node.value = rng.choice(candidates)
    return tree

def hoist_mutation(tree, rng):
    tree = copy.deepcopy(tree)
    node, parent, idx = get_random_node(tree, rng)
    if node.children:
        child_idx = rng.randint(len(node.children))
        if parent is None:
            return node.children[child_idx]
        parent.children[idx] = node.children[child_idx]
    return tree

The depth limit in subtree_crossover is our first line of defense against bloat — the tendency for trees to grow unboundedly. If a crossover produces a tree deeper than the limit, we reject it and return the original parent unchanged.

5. Fitness — Trading Accuracy for Simplicity

The simplest fitness measure is mean squared error (MSE) between the tree's predictions and the target data. But raw MSE creates a perverse incentive: trees grow larger and larger, accumulating useless "intron" branches that don't hurt accuracy but add complexity. This is the bloat problem, and it's the central challenge of genetic programming.

The classic countermeasure is parsimony pressure — adding a penalty proportional to tree size:

fitness = MSE + α × tree_size

The coefficient α controls how much we value simplicity over accuracy. Too small and bloat runs unchecked; too large and the population collapses to trivially small trees like x or 1.0.

A more principled approach tracks the Pareto front of non-dominated solutions on the accuracy-vs-complexity plane. A solution is Pareto-optimal if no other solution is both more accurate and simpler. This lets the GP maintain a diverse set of candidate expressions across the complexity spectrum.

def evaluate_fitness(tree, x_data, y_data, alpha=0.001):
    """MSE + parsimony pressure. Lower is better."""
    try:
        y_pred = tree.evaluate(x_data)
        if np.any(np.isnan(y_pred)) or np.any(np.isinf(y_pred)):
            return 1e12
        mse = np.mean((y_pred - y_data) ** 2)
    except Exception:
        return 1e12
    return mse + alpha * tree.size()

def pareto_front(population, fitnesses):
    """Return indices of Pareto-optimal individuals (MSE vs size)."""
    mses, sizes = [], []
    for i, tree in enumerate(population):
        try:
            mse_only = fitnesses[i] - 0.001 * tree.size()
        except Exception:
            mse_only = 1e12
        mses.append(mse_only)
        sizes.append(tree.size())
    front = []
    for i in range(len(population)):
        dominated = False
        for j in range(len(population)):
            if i == j:
                continue
            if mses[j] <= mses[i] and sizes[j] <= sizes[i]:
                if mses[j] < mses[i] or sizes[j] < sizes[i]:
                    dominated = True
                    break
        if not dominated:
            front.append(i)
    return front

The try/except block and NaN/Inf checks are essential safety nets. During evolution, crossover and mutation will produce pathological trees — division chains that overflow, deeply nested functions that return garbage. We assign these a catastrophic fitness of 10¹² so they're immediately selected against.

6. Taming the Constants Problem

Here's a subtlety that trips up naive implementations: suppose the true relationship is y = 3.14 * x². Our GP might discover the structure c * x² early on, but the ephemeral random constant c was initialized to 1.73. Getting from 1.73 to 3.14 by random point mutations — one digit at a time — is painfully slow.

Linear scaling (Keijzer, 2003) elegantly sidesteps this: instead of evaluating f(x) directly, we fit ŷ = a · f(x) + b using ordinary least squares. This lets evolution focus on discovering the right structure while a cheap closed-form step handles the scale and offset.

def linear_scale_fitness(tree, x_data, y_data, alpha=0.001):
    """Evaluate fitness with linear scaling: y_hat = a * f(x) + b."""
    try:
        f = tree.evaluate(x_data)
        if np.any(np.isnan(f)) or np.any(np.isinf(f)):
            return 1e12
        # Closed-form least squares for a, b
        f_mean, y_mean = np.mean(f), np.mean(y_data)
        cov = np.mean((f - f_mean) * (y_data - y_mean))
        var_f = np.mean((f - f_mean) ** 2)
        if var_f < 1e-12:
            return 1e12
        a = cov / var_f
        b = y_mean - a * f_mean
        y_pred = a * f + b
        mse = np.mean((y_pred - y_data) ** 2)
    except Exception:
        return 1e12
    return mse + alpha * tree.size()

This is remarkably effective. A tree that discovers x * x (computing x² with a scale of 1) can now fit y = 3.14x² + 0.5 perfectly, because the linear scaling layer handles the 3.14 and 0.5 automatically. Modern systems like PySR go further, running full gradient-based optimization (BFGS) on all numeric constants after each generation.

7. The Full GP Loop

Now we assemble everything into a complete symbolic regression engine. The loop is straightforward: evaluate the population, select parents, create offspring through crossover and mutation, and repeat. Elitism ensures we never lose our best solution.

def symbolic_regression(x_data, y_data, pop_size=300, generations=50,
                        max_depth=6, tournament_k=5, seed=42):
    rng = np.random.RandomState(seed)
    population = init_population(pop_size, max_depth, rng)

    best_tree, best_fit = None, 1e12
    for gen in range(generations):
        # Evaluate
        fitnesses = [linear_scale_fitness(t, x_data, y_data)
                     for t in population]

        # Track best
        gen_best_idx = int(np.argmin(fitnesses))
        if fitnesses[gen_best_idx] < best_fit:
            best_fit = fitnesses[gen_best_idx]
            best_tree = copy.deepcopy(population[gen_best_idx])

        # Create next generation
        new_pop = [copy.deepcopy(best_tree)]  # elitism
        while len(new_pop) < pop_size:
            r = rng.random()
            if r < 0.9:  # 90% crossover
                p1 = tournament_select(population, fitnesses, tournament_k, rng)
                p2 = tournament_select(population, fitnesses, tournament_k, rng)
                child = subtree_crossover(p1, p2, rng, max_depth)
            elif r < 0.95:  # 5% point mutation
                p = tournament_select(population, fitnesses, tournament_k, rng)
                child = point_mutation(p, rng)
            else:  # 5% hoist mutation
                p = tournament_select(population, fitnesses, tournament_k, rng)
                child = hoist_mutation(p, rng)
            new_pop.append(child)
        population = new_pop

        if gen % 10 == 0:
            print(f"Gen {gen}: best fitness = {best_fit:.6f}, "
                  f"expr = {best_tree}, size = {best_tree.size()}")

    return best_tree, best_fit

The operator probabilities (90% crossover, 5% point mutation, 5% hoist) follow Koza's recommendations. Crossover dominates because it's the main source of structural innovation — combining useful subtrees from different parents. Mutation provides fine-tuning and diversity maintenance.

Run this on y = x² + sin(x) with 300 individuals for 50 generations, and the GP will typically recover the exact structure within 20-30 generations. The linear scaling handles any remaining constant adjustments.

8. Modern Symbolic Regression

While our from-scratch GP captures the essential ideas, the state of the art has advanced significantly. Here are the key developments:

PySR (Cranmer, 2023) uses a multi-population island model with a Julia backend for speed. Islands evolve independently, periodically exchanging their best individuals — this parallelism explores more of the expression space. It also applies algebraic simplification (combining like terms, constant folding) between generations, keeping expressions clean.

AI Feynman (Udrescu & Tegmark, 2020) takes a physics-inspired approach: before running SR, it checks for symmetries (is the function even? periodic?), separability (does it factor as f(x)g(y)?), and dimensional constraints. These priors dramatically shrink the search space for scientific equations.

Neural-guided approaches flip the script entirely. Instead of evolving trees, they train neural networks to propose equation structures:

EQL (Sahoo et al., 2018) — replaces neural net activation functions with operators (+, ×, sin), then reads off the learned formula
NeSymReS (Biggio et al., 2021) — a transformer that reads data points and generates equation tokens, like machine translation from numbers to math

There's also a beautiful connection to Kolmogorov-Arnold Networks (KANs): where GP discovers equations bottom-up by evolving tree structures, KANs learn equations top-down by fitting splines on graph edges, then matching those splines to a function library. If you've read our KANs post, you'll recognize this as the symbolic interpretation step — complementary approaches to the same goal.

Try It: Expression Tree Evolver

Watch genetic programming evolve an expression tree to fit a target function. The left panel shows the best tree's structure; the right shows how well it matches the data.

Target:

Select a target function and click Evolve.

Try It: Rediscover Physics

Can evolution rediscover fundamental physics equations from noisy data? Pick a hidden law and watch the GP search for the formula. The Pareto front shows the accuracy-vs-complexity tradeoff.

Hidden law:

Pick a physics law and click Evolve.

9. Conclusion

Symbolic regression sits at a unique intersection of machine learning and scientific discovery. Where neural networks give you predictions, symbolic regression gives you understanding — the actual mathematical relationship hiding in your data.

We built every piece from scratch: expression trees that encode math as recursive structures, Ramped Half-and-Half initialization for population diversity, GP operators that breed and mutate formulas, parsimony pressure that fights bloat, and linear scaling that separates structure discovery from constant fitting. The full loop is surprisingly compact — under 200 lines of Python for a working symbolic regression engine.

The field is experiencing a renaissance. PySR makes production symbolic regression accessible. AI Feynman brings physics priors to bear. Neural approaches like transformers that generate equations are pushing the boundary of what's discoverable. And the connection to KANs shows how interpretability is becoming a first-class goal across ML.

The next time you fit a neural network to a dataset, ask yourself: is there a formula in there? Symbolic regression might find it.

References & Further Reading

John Koza — "Genetic Programming" (1992) — the foundational GP framework that introduced tree-based evolutionary search
Schmidt & Lipson — "Distilling Free-Form Natural Laws from Experimental Data" (Science, 2009) — landmark paper on rediscovering physics laws from data
Miles Cranmer — "Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl" (2023) — modern production-grade symbolic regression
Udrescu & Tegmark — "AI Feynman: A Physics-Inspired Method for Symbolic Regression" (Science Advances, 2020) — physics-prior-guided equation discovery
Maarten Keijzer — "Improving Symbolic Regression with Interval Arithmetic and Linear Scaling" (EuroGP, 2003) — the linear scaling technique for constant optimization
Sahoo et al. — "Learning Equations for Extrapolation and Control" (ICML, 2018) — EQL: differentiable equation learning layers
Biggio et al. — "Neural Symbolic Regression that Scales" (ICML, 2021) — NeSymReS: transformer-based equation generation
Makke & Chawla — "Interpretable Scientific Discovery with Symbolic Regression: A Review" (2024) — comprehensive survey of the field