← Back to Blog

Neuroevolution from Scratch: Evolving Neural Networks Instead of Training Them

1. What If We Never Needed Gradients?

Every major deep learning framework — PyTorch, TensorFlow, JAX — is organized around one core idea: backpropagation. Compute a loss, flow gradients backward through the computation graph, nudge weights in the direction that reduces error. It works so well that it's easy to forget there's a completely different approach sitting right there, predating gradient descent by billions of years.

Evolution.

What if instead of computing gradients, you just... tried a bunch of different neural networks, kept the ones that worked, and bred the next generation from the survivors? No chain rule. No differentiable loss function. No computation graph. Just mutation, selection, and survival of the fittest.

This is neuroevolution: applying evolutionary algorithms to evolve neural networks. And it's not a toy idea. In 2017, OpenAI showed that evolution strategies could match reinforcement learning on hard robotics tasks. That same year, Uber AI evolved 4-million-parameter convolutional networks to play Atari games — beating DQN on nearly half the games tested.

In this post, we'll build neuroevolution from scratch. We'll start with the simplest approach (evolving just the weights), progress to NEAT (evolving the network topology alongside weights), peek at HyperNEAT's clever indirect encoding, and finish with evolution strategies at industrial scale. Along the way, you'll get working Python code, two interactive demos, and a clear picture of when evolution beats gradients — and when it doesn't.

If you haven't read our Genetic Algorithms from Scratch post, that covers the GA foundations we'll build on here. But don't worry — we'll recap the essentials as we go.

2. Weight Evolution — The Simplest Neuroevolution

The simplest form of neuroevolution doesn't touch the network architecture at all. You pick a fixed topology — say, a 2-layer MLP with 4 inputs, 8 hidden neurons, and 1 output — and flatten every weight and bias into a single vector. That vector is your chromosome. Each individual in your population is one such weight vector, and fitness is measured by how well the network performs on the task.

The evolutionary loop is the standard one: initialize a random population, evaluate fitness, select the best, create offspring through crossover and mutation, repeat. The only twist is that our chromosomes are vectors of real numbers (not binary strings), so we use Gaussian mutation (add small random noise) instead of bit flips, and uniform crossover (randomly pick each weight from one parent or the other).

Let's evolve a small MLP to balance a CartPole — the classic control task where you move a cart left or right to keep a pole balanced. The network takes 4 observations (cart position, cart velocity, pole angle, pole angular velocity) and outputs a single value: positive means push right, negative means push left.

import numpy as np

def mlp_forward(weights, x):
    """2-layer MLP: 4 inputs -> 8 hidden (tanh) -> 1 output."""
    W1 = weights[:32].reshape(8, 4)
    b1 = weights[32:40]
    W2 = weights[40:48].reshape(1, 8)
    b2 = weights[48:49]
    h = np.tanh(W1 @ x + b1)
    return (W2 @ h + b2)[0]

def evaluate(weights, env, episodes=3):
    """Average reward across episodes."""
    total = 0
    for _ in range(episodes):
        obs, _ = env.reset()
        for t in range(500):
            action = 1 if mlp_forward(weights, obs) > 0 else 0
            obs, reward, done, _, _ = env.step(action)
            total += reward
            if done:
                break
    return total / episodes

def evolve_cartpole(pop_size=50, generations=40, sigma=0.1):
    """Evolve CartPole weights with a simple GA."""
    import gym
    env = gym.make('CartPole-v1')
    n_params = 4*8 + 8 + 8*1 + 1  # 49 parameters

    pop = [np.random.randn(n_params) * 0.5 for _ in range(pop_size)]

    for gen in range(generations):
        fits = [evaluate(w, env) for w in pop]
        best = np.argmax(fits)
        print(f"Gen {gen:3d} | Best: {fits[best]:.0f} | Avg: {np.mean(fits):.0f}")

        # Elitism: keep the best individual
        next_pop = [pop[best].copy()]
        while len(next_pop) < pop_size:
            # Tournament selection (k=3)
            idx = np.random.choice(pop_size, 3, replace=False)
            p1 = pop[idx[np.argmax([fits[i] for i in idx])]]
            idx = np.random.choice(pop_size, 3, replace=False)
            p2 = pop[idx[np.argmax([fits[i] for i in idx])]]
            # Uniform crossover + Gaussian mutation
            mask = np.random.rand(n_params) < 0.5
            child = np.where(mask, p1, p2)
            child += np.random.randn(n_params) * sigma
            next_pop.append(child)
        pop = next_pop

    env.close()
    return pop[0]  # elite is always at index 0

# evolve_cartpole()
# Gen   0 | Best:  62 | Avg:  24
# Gen  10 | Best: 189 | Avg:  87
# Gen  20 | Best: 412 | Avg: 201
# Gen  30 | Best: 500 | Avg: 389
# Gen  39 | Best: 500 | Avg: 467

In about 40 generations, the population converges to networks that can balance the pole for the full 500 steps. No gradients were harmed in the making of this controller.

But notice the limitation: we hand-designed the architecture (4→8→1 with tanh activations). The GA is just doing continuous optimization over a 49-dimensional search space. For that, gradient descent would actually be more sample-efficient — it extracts directional information from each evaluation, while the GA treats each network as a black box. Weight evolution shines when you can't compute gradients (non-differentiable rewards, chaotic simulations) or when massive parallelism is available (evaluate thousands of candidates simultaneously).

3. NEAT — Evolving Topology AND Weights

Weight evolution left the biggest question on the table: what should the network architecture look like? How many layers? How many neurons? Which connections? In traditional deep learning, you answer these questions through trial and error, hyperparameter search, or neural architecture search. But in 2002, Kenneth Stanley and Risto Miikkulainen proposed something elegant: let evolution figure it out.

Their algorithm, NEAT (NeuroEvolution of Augmenting Topologies), starts every network as the simplest possible structure: direct connections from inputs to outputs, no hidden nodes at all. Then, over generations, structural mutations add complexity only when it helps. This "complexification" principle means NEAT searches the simplest effective topology first, adding nodes and connections only when they improve fitness.

Three innovations make NEAT work:

Innovation Numbers

When two different genomes have different structures, how do you align them for crossover? In a fixed-length chromosome, gene 17 in parent A and gene 17 in parent B represent the same weight. But in a variable-topology network, genomes have different lengths and connection sets. NEAT solves this with a global innovation counter: every time a new connection or node is created (anywhere in the population), it gets a unique innovation number. During crossover, you align genes by their innovation numbers — matching genes are inherited randomly from either parent, while excess and disjoint genes come from the fitter parent.

Speciation

A brand-new structural mutation is almost always harmful at first — the new node has random weights and hasn't been optimized yet. In a flat population, it would be eliminated immediately. NEAT protects innovations by grouping similar genomes into species and sharing fitness within each species. A new structural variant only competes against similar topologies, giving it time to optimize its weights before facing the general population.

Complexification

NEAT has two structural mutations: add a connection (wire up two previously unconnected nodes) and add a node (split an existing connection by inserting a new neuron). The add-node mutation disables the old connection and creates two new ones: one with weight 1.0 into the new node, and one with the original weight out of it. This preserves the network's behavior immediately after the mutation — the new node acts as a pass-through until its weights are further evolved.

Here's a simplified NEAT genome representation with all the core operations:

import random
from dataclasses import dataclass, field

innovation_counter = 0

@dataclass
class ConnectionGene:
    in_node: int
    out_node: int
    weight: float
    enabled: bool = True
    innovation: int = 0

@dataclass
class Genome:
    node_genes: list
    connections: list = field(default_factory=list)

    def add_connection(self, in_node, out_node):
        global innovation_counter
        innovation_counter += 1
        self.connections.append(ConnectionGene(
            in_node, out_node,
            weight=random.gauss(0, 1),
            innovation=innovation_counter
        ))

    def add_node(self):
        """Split a random enabled connection with a new neuron."""
        global innovation_counter
        enabled = [c for c in self.connections if c.enabled]
        if not enabled:
            return
        conn = random.choice(enabled)
        conn.enabled = False
        new_id = max(self.node_genes) + 1
        self.node_genes.append(new_id)
        # in -> new (weight 1.0), new -> out (original weight)
        innovation_counter += 1
        self.connections.append(ConnectionGene(
            conn.in_node, new_id, 1.0,
            innovation=innovation_counter))
        innovation_counter += 1
        self.connections.append(ConnectionGene(
            new_id, conn.out_node, conn.weight,
            innovation=innovation_counter))

    def mutate_weights(self, rate=0.8, sigma=0.2):
        for c in self.connections:
            if random.random() < rate:
                c.weight += random.gauss(0, sigma)
            else:
                c.weight = random.gauss(0, 1)

def crossover(p1, p2, fit1, fit2):
    """Align genes by innovation number, inherit from fitter parent."""
    genes1 = {c.innovation: c for c in p1.connections}
    genes2 = {c.innovation: c for c in p2.connections}
    child_conns = []
    for innov in sorted(set(genes1) | set(genes2)):
        if innov in genes1 and innov in genes2:
            chosen = random.choice([genes1[innov], genes2[innov]])
        elif innov in genes1:
            chosen = genes1[innov] if fit1 >= fit2 else None
        else:
            chosen = genes2[innov] if fit2 >= fit1 else None
        if chosen:
            child_conns.append(ConnectionGene(
                chosen.in_node, chosen.out_node,
                chosen.weight, chosen.enabled, chosen.innovation))
    nodes = set(p1.node_genes)
    for c in child_conns:
        nodes.update([c.in_node, c.out_node])
    return Genome(sorted(nodes), child_conns)

def compatibility(g1, g2, c1=1.0, c2=0.4):
    """Distance metric for speciation."""
    genes1 = {c.innovation: c for c in g1.connections}
    genes2 = {c.innovation: c for c in g2.connections}
    matching = set(genes1) & set(genes2)
    disjoint = len(set(genes1) ^ set(genes2))
    avg_w = (sum(abs(genes1[i].weight - genes2[i].weight)
             for i in matching) / max(len(matching), 1))
    N = max(len(g1.connections), len(g2.connections), 1)
    return c1 * disjoint / N + c2 * avg_w

The crossover function is where innovation numbers pay off. Matching genes (same innovation number in both parents) are randomly inherited. Disjoint and excess genes (present in only one parent) come from the fitter parent. This means two genomes with completely different topologies can still produce valid offspring — the alignment happens through history, not structure.

Now let's use these building blocks to evolve a network that solves XOR — the classic NEAT benchmark. XOR can't be solved by a network without hidden nodes (it's not linearly separable), so NEAT must discover at least one hidden neuron to succeed:

import numpy as np

def feed_forward(genome, inputs):
    """Evaluate a NEAT genome with proper topological ordering."""
    values = {0: inputs[0], 1: inputs[1], 2: 1.0}  # x1, x2, bias
    active = [c for c in genome.connections if c.enabled]

    # Compute depth of each node (longest path from inputs)
    depth = {0: 0, 1: 0, 2: 0}
    changed = True
    while changed:
        changed = False
        for c in active:
            if c.in_node in depth:
                d = depth[c.in_node] + 1
                if c.out_node not in depth or depth[c.out_node] < d:
                    depth[c.out_node] = d
                    changed = True

    # Process non-input nodes in depth order, sigmoid per node
    ordered = sorted(
        [n for n in genome.node_genes if n not in (0, 1, 2)],
        key=lambda n: depth.get(n, 1)
    )
    for node in ordered:
        total = sum(values.get(c.in_node, 0) * c.weight
                    for c in active if c.out_node == node)
        values[node] = 1 / (1 + np.exp(-total))

    return values.get(3, 0.5)  # output node = 3

def evolve_xor(pop_size=150, generations=100):
    xor_data = [([0,0], 0), ([0,1], 1), ([1,0], 1), ([1,1], 0)]

    # Minimal starting topology: 3 inputs -> 1 output
    population = []
    for _ in range(pop_size):
        g = Genome([0, 1, 2, 3])
        g.add_connection(0, 3)
        g.add_connection(1, 3)
        g.add_connection(2, 3)
        g.mutate_weights()
        population.append(g)

    for gen in range(generations):
        fitnesses = []
        for genome in population:
            error = sum((feed_forward(genome, x) - y)**2
                        for x, y in xor_data)
            fitnesses.append(4.0 - error)

        best = max(fitnesses)
        avg_nodes = np.mean([len(g.node_genes) for g in population])
        print(f"Gen {gen:3d} | Best: {best:.3f} | Nodes: {avg_nodes:.1f}")

        if best > 3.9:
            winner = population[np.argmax(fitnesses)]
            print(f"Solved! Network has {len(winner.node_genes)} nodes")
            return winner

        # Selection + reproduction (simplified)
        ranked = sorted(range(pop_size), key=lambda i: fitnesses[i])
        next_pop = []
        for _ in range(pop_size):
            p = population[ranked[np.random.randint(pop_size//2, pop_size)]]
            child = Genome(list(p.node_genes), [
                ConnectionGene(c.in_node, c.out_node, c.weight,
                               c.enabled, c.innovation)
                for c in p.connections])
            child.mutate_weights()
            if random.random() < 0.1:
                child.add_connection(
                    random.choice(child.node_genes[:3]),
                    random.choice(child.node_genes[3:]))
            if random.random() < 0.05:
                child.add_node()
            next_pop.append(child)
        population = next_pop

# evolve_xor()
# Gen   0 | Best: 2.441 | Nodes: 4.0
# Gen  15 | Best: 3.102 | Nodes: 5.2
# Gen  30 | Best: 3.687 | Nodes: 6.1
# Gen  45 | Best: 3.924 | Nodes: 6.8
# Solved! Network has 6 nodes

Watch what happens: the population starts with 4-node networks (3 inputs + 1 output) that can't solve XOR. Around generation 15-20, the add_node mutation introduces hidden neurons. Networks with 5-6 nodes start appearing, and fitness jumps. By generation 45, a 6-node network (two hidden neurons) has learned XOR with error below 0.1 — a topology that NEAT discovered on its own.

Try It: NEAT Topology Evolution

Watch a neural network evolve to solve XOR. It starts with just 3 input→output connections and grows hidden nodes as needed. Blue connections have positive weights, red have negative. Thickness shows weight magnitude. Small gray numbers are innovation IDs.

4. HyperNEAT — Indirect Encoding for Scale

NEAT is elegant, but it has a fundamental scaling problem: every connection in the network requires its own gene. A 1000-neuron network with dense connectivity would need a million connection genes. That's a million-dimensional search space for evolution, and crossover between such large genomes becomes increasingly destructive.

Kenneth Stanley's follow-up, HyperNEAT (2009), solves this with a beautiful insight borrowed from developmental biology. Your DNA doesn't encode every synapse in your brain individually — it encodes a developmental program that generates the connectivity patterns. HyperNEAT does the same thing: instead of evolving the weight matrix directly, it evolves a small network (called a CPPN — Compositional Pattern-Producing Network) that generates the weight matrix.

A CPPN takes the coordinates of two neurons — (x1, y1) for the source and (x2, y2) for the target — and outputs the connection weight between them. By using activation functions like sin (for repetition), Gaussian (for symmetry), and linear (for gradients), the CPPN can express geometric patterns in the connectivity: left-right symmetry, center-surround receptive fields, repeating modules.

import numpy as np

def cppn_query(x1, y1, x2, y2, params):
    """CPPN: spatial coordinates -> connection weight."""
    d = np.sqrt((x2 - x1)**2 + (y2 - y1)**2)        # distance
    h1 = np.sin(params[0] * d + params[1])            # repetition
    h2 = np.exp(-params[2] * ((x1 + x2) / 2)**2)     # symmetry
    h3 = params[3] * (y2 - y1)                        # gradient
    return np.tanh(params[4]*h1 + params[5]*h2 + params[6]*h3 + params[7])

def generate_weight_matrix(cppn_params, src_pos, tgt_pos):
    """Query CPPN for every source-target pair."""
    W = np.zeros((len(tgt_pos), len(src_pos)))
    for i, (x1, y1) in enumerate(src_pos):
        for j, (x2, y2) in enumerate(tgt_pos):
            W[j, i] = cppn_query(x1, y1, x2, y2, cppn_params)
    return W

# Substrate: 4x1 input grid, 3x1 output grid
inputs  = [(-1.5, -1), (-0.5, -1), (0.5, -1), (1.5, -1)]
outputs = [(-1.0,  1), ( 0.0,  1), (1.0,  1)]

# These 8 CPPN params would normally be evolved by NEAT
params = np.array([3.0, 0.5, 2.0, 0.1, 1.2, -0.8, 0.5, 0.0])
W = generate_weight_matrix(params, inputs, outputs)

print("Generated 3x4 weight matrix from 8 CPPN params:")
print(np.round(W, 3))
# [[ 0.488  0.3    0.524 -0.877]
#  [ 0.773 -0.136 -0.136  0.773]
#  [-0.877  0.524  0.3    0.488]]
# Notice the symmetry: row 0 and row 2 are mirror images!

The compression is remarkable: 8 CPPN parameters generated a 3×4 weight matrix (12 values). For a real HyperNEAT system with a 100×100 input grid and a 50×50 hidden layer, a small CPPN with perhaps 50 parameters would generate 25 million connection weights. That's a 500,000× compression ratio. The CPPN itself is evolved by NEAT, so you get topology evolution of the generator network for free.

5. Evolution Strategies — Neuroevolution at Scale

In 2017, a team at OpenAI published a paper that turned heads: "Evolution Strategies as a Scalable Alternative to Reinforcement Learning." They showed that a simple evolutionary method — not NEAT, not even a genetic algorithm, but plain parameter perturbation — could solve hard RL tasks like MuJoCo humanoid locomotion in 10 minutes using 1,440 CPU cores.

The algorithm, called OpenAI ES, is beautifully simple. You have a single parameter vector θ. Each generation, you sample N random perturbations εi from a Gaussian, evaluate the fitness of each perturbed candidate θ + σεi, then update θ in the direction that increases fitness:

θnew = θ + (α / Nσ) ∑ F(θ + σεi) · εi

This is mathematically equivalent to estimating the gradient of a Gaussian-smoothed version of the fitness function using finite differences. The smoothing radius σ controls how "blurry" the landscape appears to the optimizer. Large σ smooths away local optima; small σ follows the raw surface like gradient descent.

The parallelism trick is what makes ES scale: instead of communicating full parameter vectors (millions of floats), workers share only random seeds. Each worker reconstructs the perturbation εi locally from the seed and sends back a single scalar (the fitness). Communication drops from O(parameters) to O(1) per worker.

import numpy as np

def es_cartpole(n_params=49, pop_size=50, generations=40,
                sigma=0.1, alpha=0.03):
    """OpenAI-style Evolution Strategies for CartPole."""
    import gym
    env = gym.make('CartPole-v1')
    theta = np.zeros(n_params)

    for gen in range(generations):
        # Sample perturbations
        epsilons = [np.random.randn(n_params) for _ in range(pop_size)]

        # Evaluate perturbed candidates
        rewards = np.array([
            evaluate(theta + sigma * eps, env) for eps in epsilons
        ])

        # Normalize rewards for stability
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

        # Estimated gradient: weighted sum of perturbations
        grad = sum(r * eps for r, eps in zip(rewards, epsilons))
        theta += (alpha / (pop_size * sigma)) * grad

        score = evaluate(theta, env)
        print(f"Gen {gen:3d} | Score: {score:.0f}")

    env.close()
    return theta

# es_cartpole()
# Gen   0 | Score:  31
# Gen  10 | Score: 156
# Gen  20 | Score: 378
# Gen  30 | Score: 500
# Gen  39 | Score: 500

Compare this to the weight GA from Section 2: similar convergence, but ES uses a fundamentally different update rule. The GA selects and recombines discrete individuals; ES computes a population-level gradient estimate and updates a single parameter vector. ES is closer in spirit to gradient descent than to traditional evolutionary algorithms — in fact, in the limit of infinitesimal perturbations, ES converges to the true gradient (Whitelam et al., 2021).

Deep Neuroevolution: Raw Evolutionary Power

That same year, Uber AI Labs pushed the envelope further with Deep Neuroevolution (Such et al., 2017). They applied a bare-bones genetic algorithm — no crossover, just Gaussian mutation — to evolve 4-million-parameter CNNs for Atari games. The key trick was random-seed encoding: instead of storing 4 million floats per individual, they stored only the sequence of random seeds used to generate the mutations. An individual's full parameter vector could be reconstructed by replaying the seed sequence, compressing storage by ~1000×.

The result? A simple GA, with no gradient information whatsoever, beat DQN on roughly half of the Atari games tested. Not by being smarter, but by being massively parallel: with enough CPUs, you can evaluate millions of candidates per hour and let sheer search volume compensate for the lack of gradient signal.

Try It: ES vs Gradient Descent

Both optimizers start at the same point (white dot) on a multi-modal landscape. The red dot follows true gradient ascent; the blue cloud shows the ES population, with its center marked in cyan. Watch how ES smooths past local optima that trap gradient descent.

6. When Evolution Beats Gradients (and When It Doesn't)

Let's put all the methods side by side:

Method Best Use Case Param Scale Topology? Parallelism
Backprop Supervised / differentiable RL Billions No Data-parallel
Weight GA Small black-box control ~1K No Per-individual
NEAT Evolving network structure ~100s Yes Per-individual
ES Large-scale RL Millions No Linear scaling
CMA-ES Precise continuous optimization ~1K No Per-individual

Evolution wins when:

Gradients win when:

The deepest insight: in the limit of small mutations (σ → 0), evolution strategies converge to gradient descent (Whitelam et al., 2021). The two approaches aren't opposing philosophies — they're endpoints on a continuum, with the perturbation scale σ controlling the tradeoff between global exploration and local precision.

7. The Return of Evolution

We've traveled from the simplest neuroevolution (flatten weights, run a GA) to NEAT's topology evolution, HyperNEAT's indirect encoding, and evolution strategies at industrial scale. Along the way, a clear picture emerged: neuroevolution is not a curiosity or a historical footnote — it's a legitimate training paradigm with unique advantages.

The key insight is that neuroevolution searches the joint space of architectures and parameters. Gradient methods optimize weights for a topology you chose. Neural architecture search explores topologies but trains each candidate with gradients. Only neuroevolution does both simultaneously in a single evolutionary process — the same mutations that adjust weights can also add nodes, rewire connections, or discover entirely new network structures.

Modern neuroevolution continues to evolve (pun intended). Quality-diversity algorithms like MAP-Elites don't just find one good solution — they fill an archive with diverse, high-performing solutions across a space of behaviors. Differentiable neural architecture search (DARTS) blends the two paradigms, using gradients to optimize a continuous relaxation of the architecture space. And every time someone announces "we evolved a neural network to do X," the deep learning community does a collective double-take.

There's a satisfying irony here. Deep learning started with biologically-inspired neurons, borrowed its name from brains, then abandoned biology for calculus. Neuroevolution brings the biology back — not as a metaphor, but as a working training algorithm. Sometimes the most powerful optimization method is the one that's been running for 3.5 billion years.

References & Further Reading