← Back to Blog

Genetic Algorithms from Scratch: Evolution as Optimization

Why Evolution? When Gradients Can't Help

Gradient descent is brilliant — when you have gradients. You compute a loss, backpropagate through your model, and nudge every parameter downhill. It's elegant, efficient, and powers nearly all of modern deep learning.

But what happens when you don't have gradients?

Maybe your objective function is discontinuous — like "does this circuit design pass all test cases?" Maybe it's combinatorial — like "what's the shortest route through 50 cities?" Maybe it's a complete black box — some simulator that takes parameters in and spits a fitness score out, with no differentiable path between them.

For these problems, gradient descent is useless. You can't differentiate through a boolean test suite or a city permutation. But evolution solved optimization 3.8 billion years before calculus existed, and its strategy is remarkably simple: maintain a population of candidate solutions, let the best ones reproduce, and let random variation introduce novelty.

Think of it this way: gradient descent is a single hiker following the steepest slope downhill. A genetic algorithm is a hundred hikers dropped randomly across a jagged landscape who share notes about the best views they've found. The single hiker is faster on smooth terrain, but the hundred hikers are far better at exploring a landscape full of deceptive valleys and hidden peaks.

A genetic algorithm has four core operators:

  1. Selection — pick the fittest individuals to become parents
  2. Crossover — combine two parents to create offspring
  3. Mutation — randomly perturb offspring to maintain diversity
  4. Replacement — form the next generation and repeat

That's the entire algorithm. No gradients, no loss surfaces, no chain rule. Just reproduction with variation and survival of the fittest. Let's build one.

Our first target: maximize f(x) = x · sin(10πx) + 2 on the interval [0, 2]. This function has multiple peaks and valleys — a gradient follower would get trapped in the nearest local maximum, but a population-based search can explore all of them simultaneously.

import random
import math

def fitness(x):
    """Multimodal function with many local optima."""
    return x * math.sin(10 * math.pi * x) + 2.0

def encode(x, bits=16):
    """Encode a float in [0, 2] as a binary string."""
    scaled = int(x / 2.0 * (2**bits - 1))
    return format(scaled, f'0{bits}b')

def decode(bitstring):
    """Decode a binary string back to a float in [0, 2]."""
    return int(bitstring, 2) / (2**len(bitstring) - 1) * 2.0

def crossover(parent1, parent2):
    """Single-point crossover."""
    point = random.randint(1, len(parent1) - 1)
    return parent1[:point] + parent2[point:]

def mutate(bitstring, rate=0.05):
    """Bit-flip mutation."""
    return ''.join(
        ('1' if b == '0' else '0') if random.random() < rate else b
        for b in bitstring
    )

# Initialize population of random binary strings
random.seed(42)
pop_size, bits, generations = 40, 16, 50
population = [encode(random.uniform(0, 2), bits) for _ in range(pop_size)]

for gen in range(generations):
    scores = [fitness(decode(ind)) for ind in population]
    # Roulette wheel selection (shift scores to be positive)
    min_s = min(scores)
    weights = [s - min_s + 1e-6 for s in scores]
    total = sum(weights)
    probs = [w / total for w in weights]

    next_gen = []
    for _ in range(pop_size):
        p1 = random.choices(population, weights=probs)[0]
        p2 = random.choices(population, weights=probs)[0]
        child = crossover(p1, p2)
        child = mutate(child)
        next_gen.append(child)
    population = next_gen

best = max(population, key=lambda ind: fitness(decode(ind)))
print(f"Best x = {decode(best):.4f}, f(x) = {fitness(decode(best)):.4f}")
# Output: Best x = 1.8506, f(x) = 3.8503

In 50 generations with just 40 individuals, the GA found a solution near the global maximum. No gradients computed, no derivatives taken — just selection pressure and random variation doing what they've done for billions of years.

Selection — Survival of the Fittest

Selection is the engine that drives convergence. It decides which individuals get to reproduce, and how you select dramatically affects how your algorithm behaves. Too much selection pressure and you converge prematurely to a local optimum. Too little and you're doing random search.

Three strategies dominate in practice:

Roulette wheel selection assigns each individual a probability proportional to its fitness. If individual A has fitness 10 and individual B has fitness 5, A is twice as likely to be selected. Simple, but problematic: one super-fit individual can dominate the entire selection, starving diversity.

Tournament selection randomly picks k individuals and selects the best among them. The tournament size k is your selection pressure dial: k=2 is gentle (the better of two random individuals wins), k=7 is aggressive (only the elite survive). This is the most widely used method because the pressure is easy to tune and doesn't require global fitness comparisons.

Rank-based selection sorts individuals by fitness and assigns selection probability by rank, not absolute fitness value. The best individual gets rank N, the worst gets rank 1. This prevents a single outlier from dominating — whether the best fitness is 100 or 1,000,000, it still gets the same rank.

Let's compare them on a tricky multimodal function:

import random
import math

def rastrigin(x):
    """Rastrigin function: many local optima, one global optimum at x=0."""
    return -(x**2 - 10 * math.cos(2 * math.pi * x) + 10)

def roulette_select(population, scores):
    """Fitness-proportionate selection."""
    min_s = min(scores)
    weights = [s - min_s + 1e-6 for s in scores]
    return random.choices(population, weights=weights)[0]

def tournament_select(population, scores, k=3):
    """Tournament selection with configurable pressure."""
    contestants = random.sample(list(zip(population, scores)), k)
    return max(contestants, key=lambda pair: pair[1])[0]

def rank_select(population, scores):
    """Rank-based selection."""
    ranked = sorted(zip(population, scores), key=lambda p: p[1])
    ranks = list(range(1, len(ranked) + 1))
    return random.choices([ind for ind, _ in ranked], weights=ranks)[0]

def run_ga(select_fn, label, pop_size=60, gens=80):
    random.seed(7)
    pop = [random.uniform(-5.12, 5.12) for _ in range(pop_size)]
    best_history = []
    for gen in range(gens):
        scores = [rastrigin(x) for x in pop]
        best_history.append(max(scores))
        next_gen = []
        for _ in range(pop_size):
            p1 = select_fn(pop, scores)
            p2 = select_fn(pop, scores)
            child = (p1 + p2) / 2 + random.gauss(0, 0.3)
            child = max(-5.12, min(5.12, child))
            next_gen.append(child)
        pop = next_gen
    print(f"{label}: best = {max(best_history):.4f} (gen {best_history.index(max(best_history))})")

run_ga(roulette_select, "Roulette ")
run_ga(tournament_select, "Tournament")
run_ga(rank_select, "Rank      ")
# Roulette : best = -0.0000 (gen 25)
# Tournament: best = -0.0000 (gen 18)
# Rank      : best = -0.0001 (gen 31)

Tournament selection typically converges fastest because it applies consistent pressure without being dominated by outliers. Roulette can be erratic — one lucky individual can flood the next generation with copies of itself. Rank-based is the most stable but sometimes the slowest, since it ignores how much better the best individual is.

The selection pressure tradeoff mirrors a theme we've seen before: in softmax temperature, low temperature concentrates probability on the top choice (high pressure), while high temperature spreads probability evenly (low pressure). Tournament size in GAs plays exactly the same role as inverse temperature in sampling.

Crossover and Mutation — The Engines of Innovation

Selection can only choose from what exists. To create genuinely new solutions, you need two operators: crossover combines existing good ideas, and mutation introduces random novelty.

Crossover takes two parent chromosomes and produces offspring that inherit traits from both. Three common strategies:

Mutation adds random perturbation — bit-flips for binary encoding, Gaussian noise for real-valued encoding. It's the GA's exploration mechanism, preventing the population from collapsing to a single solution.

The crucial balance: crossover exploits existing good solutions by recombining them. Mutation explores new territory by adding randomness. Too much crossover without mutation and you converge prematurely. Too much mutation without crossover and you're doing random search. The sweet spot is what makes GAs work.

import random

def single_point_crossover(p1, p2):
    point = random.randint(1, len(p1) - 1)
    return p1[:point] + p2[point:]

def two_point_crossover(p1, p2):
    a, b = sorted(random.sample(range(1, len(p1)), 2))
    return p1[:a] + p2[a:b] + p1[b:]

def uniform_crossover(p1, p2):
    return [random.choice([g1, g2]) for g1, g2 in zip(p1, p2)]

def gaussian_mutate(individual, rate=0.1, sigma=0.3):
    """Mutate real-valued genes with Gaussian noise."""
    return [
        gene + random.gauss(0, sigma) if random.random() < rate else gene
        for gene in individual
    ]

# Example: apply each operator to a sample chromosome
random.seed(0)
a, b = [1.0, 2.0, 3.0, 4.0], [5.0, 6.0, 7.0, 8.0]
print("Single-pt:", single_point_crossover(a, b))
print("Two-pt:   ", two_point_crossover(a, b))
print("Uniform:  ", uniform_crossover(a, b))
print("Mutated:  ", gaussian_mutate(a, rate=0.5, sigma=1.0))

# Compare crossover vs mutation on f(x, y) = -(x^2 + y^2)
def sphere_fitness(ind):
    return -(ind[0]**2 + ind[1]**2)  # optimum at (0, 0)

random.seed(42)
configs = {
    "Crossover only": (True, False),
    "Mutation only":  (False, True),
    "Both":           (True, True),
}

for label, (use_cross, use_mut) in configs.items():
    pop = [[random.uniform(-5, 5), random.uniform(-5, 5)] for _ in range(50)]
    for gen in range(100):
        scores = [sphere_fitness(ind) for ind in pop]
        ranked = sorted(zip(pop, scores), key=lambda p: p[1], reverse=True)
        elite = [ind for ind, _ in ranked[:10]]  # top 20% survive
        next_gen = list(elite)
        while len(next_gen) < 50:
            p1, p2 = random.sample(elite, 2)
            if use_cross:
                child = [(g1 + g2) / 2 for g1, g2 in zip(p1, p2)]
            else:
                child = list(random.choice([p1, p2]))
            if use_mut:
                child = gaussian_mutate(child, rate=0.3, sigma=0.5)
            next_gen.append(child)
        pop = next_gen
    best = max(pop, key=sphere_fitness)
    print(f"{label:<16s}: best = ({best[0]:+.4f}, {best[1]:+.4f}), "
          f"fitness = {sphere_fitness(best):.6f}")
# Crossover only  : best = (+0.0192, -0.0078), fitness = -0.000430
# Mutation only    : best = (-0.0438, +0.0615), fitness = -0.005700
# Both             : best = (+0.0013, -0.0002), fitness = -0.000002

The results tell the story. Crossover alone gets close but stalls — once all individuals are similar, averaging parents produces more of the same. Mutation alone wanders randomly and converges slowly. But combine them and you get the best of both worlds: crossover quickly combines good traits while mutation keeps exploring, preventing the population from going stale.

A common rule of thumb for mutation rate: 1/L where L is the chromosome length. For real-valued problems, Gaussian mutation with σ proportional to the current search radius works well — start with large steps and reduce over time, similar to how learning rate schedules decay the step size in gradient descent.

The Traveling Salesman — Where Gradients Fear to Tread

The Traveling Salesman Problem (TSP) is the poster child for problems where gradient descent can't help. Given N cities, find the shortest route that visits every city exactly once and returns to the start. The search space is (N-1)!/2 possible tours — for just 20 cities, that's over 60 quadrillion routes. Brute force is hopeless. And since the solution is a permutation, there's no continuous space to take gradients through.

GAs handle TSP naturally. The key trick is using order crossover (OX) instead of the simple crossover we've used so far. Regular crossover would break the permutation constraint — you can't just splice two tours together because you'd visit some cities twice and miss others. OX preserves the relative ordering from both parents while maintaining a valid permutation.

import random
import math

def tour_distance(tour, cities):
    """Total distance of a tour (returns to start)."""
    dist = 0
    for i in range(len(tour)):
        c1, c2 = cities[tour[i]], cities[tour[(i + 1) % len(tour)]]
        dist += math.sqrt((c1[0] - c2[0])**2 + (c1[1] - c2[1])**2)
    return dist

def order_crossover(p1, p2):
    """OX: copy a segment from p1, fill remaining from p2 in order."""
    n = len(p1)
    a, b = sorted(random.sample(range(n), 2))
    child = [None] * n
    child[a:b] = p1[a:b]
    segment = set(p1[a:b])
    fill = [c for c in p2 if c not in segment]
    idx = 0
    for i in range(n):
        if child[i] is None:
            child[i] = fill[idx]
            idx += 1
    return child

def swap_mutate(tour, rate=0.05):
    """Swap two random cities with given probability."""
    if random.random() < rate:
        i, j = random.sample(range(len(tour)), 2)
        tour[i], tour[j] = tour[j], tour[i]
    return tour

# Generate 20 random cities
random.seed(42)
n_cities = 20
cities = [(random.uniform(0, 100), random.uniform(0, 100))
          for _ in range(n_cities)]

# GA for TSP
pop_size, generations = 100, 200
population = [random.sample(range(n_cities), n_cities)
              for _ in range(pop_size)]

best_ever = None
best_dist = float('inf')

for gen in range(generations):
    distances = [tour_distance(t, cities) for t in population]

    # Track best
    gen_best_idx = min(range(pop_size), key=lambda i: distances[i])
    if distances[gen_best_idx] < best_dist:
        best_dist = distances[gen_best_idx]
        best_ever = list(population[gen_best_idx])

    # Tournament selection + elitism
    next_gen = [list(best_ever)]  # elitism: keep the best
    while len(next_gen) < pop_size:
        # Tournament select two parents (lower distance = better)
        t1 = min(random.sample(range(pop_size), 3), key=lambda i: distances[i])
        t2 = min(random.sample(range(pop_size), 3), key=lambda i: distances[i])
        child = order_crossover(population[t1], population[t2])
        child = swap_mutate(child, rate=0.15)
        next_gen.append(child)
    population = next_gen

initial_random = tour_distance(random.sample(range(n_cities), n_cities), cities)
print(f"Random tour distance: {initial_random:.1f}")
print(f"GA best distance:     {best_dist:.1f}")
print(f"Improvement:          {(1 - best_dist / initial_random) * 100:.1f}%")
# Random tour distance: 661.3
# GA best distance:     382.7
# Improvement:          42.1%

The GA doesn't guarantee the optimal tour — that would require checking all 60 quadrillion possibilities. But it consistently finds tours within 10-20% of optimal in seconds, which is exactly the kind of "good enough, fast enough" tradeoff that makes GAs practical for real-world combinatorial problems like circuit routing, warehouse layout, and delivery logistics.

Neuroevolution — Training Neural Nets Without Gradients

Here's a mind-bending idea: what if you trained a neural network without ever computing a gradient? Instead of backpropagation, you encode the network's weights as a flat list of numbers (a chromosome), evaluate how well the network performs (fitness), and evolve better weights through selection, crossover, and mutation.

This is neuroevolution, and it's not just a curiosity. The NEAT algorithm (Stanley & Miikkulainen, 2002) evolved both network weights and topology, growing networks from simple to complex. In 2017, OpenAI showed that evolution strategies — a close relative of GAs — could match reinforcement learning algorithms on Atari games and MuJoCo tasks, while being embarrassingly parallel across thousands of CPUs.

Let's evolve a tiny neural network to solve XOR — the classic problem that single-layer perceptrons can't handle:

import random
import math

def sigmoid(x):
    return 1.0 / (1.0 + math.exp(-max(-500, min(500, x))))

def forward(weights, x1, x2):
    """2-input, 4-hidden, 1-output network. 17 weights total."""
    # Hidden layer: 4 neurons, each with 2 inputs + 1 bias = 12 weights
    hidden = []
    for i in range(4):
        z = weights[i*3] * x1 + weights[i*3+1] * x2 + weights[i*3+2]
        hidden.append(sigmoid(z))
    # Output: 1 neuron with 4 inputs + 1 bias = 5 weights
    z = sum(weights[12+i] * hidden[i] for i in range(4)) + weights[16]
    return sigmoid(z)

def xor_fitness(weights):
    """Fitness = negative total error on all 4 XOR cases."""
    cases = [(0, 0, 0), (0, 1, 1), (1, 0, 1), (1, 1, 0)]
    error = sum((forward(weights, x1, x2) - y)**2 for x1, x2, y in cases)
    return -error  # higher is better

# Evolve XOR solver
random.seed(42)
n_weights = 17  # 2*4 + 4 biases + 4*1 + 1 bias
pop_size = 200
population = [[random.gauss(0, 1) for _ in range(n_weights)]
              for _ in range(pop_size)]

for gen in range(150):
    scores = [xor_fitness(ind) for ind in population]

    # Elitism: keep top 20
    ranked = sorted(zip(population, scores), key=lambda p: p[1], reverse=True)
    elite = [list(ind) for ind, _ in ranked[:20]]

    # Breed next generation
    next_gen = [list(e) for e in elite]
    while len(next_gen) < pop_size:
        p1, p2 = random.sample(elite, 2)
        # Uniform crossover for real-valued weights
        child = [random.choice([g1, g2]) for g1, g2 in zip(p1, p2)]
        # Gaussian mutation
        child = [g + random.gauss(0, 0.2) if random.random() < 0.15 else g
                 for g in child]
        next_gen.append(child)
    population = next_gen

best = max(population, key=xor_fitness)
print("Evolved XOR network predictions:")
for x1, x2, expected in [(0,0,0), (0,1,1), (1,0,1), (1,1,0)]:
    pred = forward(best, x1, x2)
    print(f"  ({x1}, {x2}) -> {pred:.4f}  (expected {expected})")
# Evolved XOR network predictions:
#   (0, 0) -> 0.0347  (expected 0)
#   (0, 1) -> 0.9621  (expected 1)
#   (1, 0) -> 0.9589  (expected 1)
#   (1, 1) -> 0.0512  (expected 0)

Without computing a single gradient, evolution found weights that solve XOR. The network outputs near-0 for (0,0) and (1,1), and near-1 for (0,1) and (1,0). It's slower than backpropagation for this simple case — gradient descent would solve XOR in seconds — but neuroevolution shines where backpropagation struggles: non-differentiable reward functions, discrete action spaces, and problems where the loss landscape is so deceptive that gradients point the wrong way.

Advanced Techniques — Elitism, Niching, and CMA-ES

The basic GA we've built works, but three techniques make the leap from toy problems to real-world optimization:

Elitism guarantees the best individual survives to the next generation unchanged. Without it, your best solution can be lost to crossover and mutation. We already used this in our TSP solver — copying the best tour directly into the next generation.

Niching (or fitness sharing) prevents the entire population from collapsing to a single solution. When two individuals are too similar, their fitness is penalized. This encourages the population to spread out and find multiple optima — critical when you want diverse solutions, not just the single best one.

CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is the state of the art for continuous optimization without gradients. Instead of evolving individual solutions, it evolves the search distribution itself — a multivariate Gaussian whose mean, step size, and covariance matrix adapt based on which search directions have been successful. Think of it as the Adam optimizer of evolutionary computation: it learns the shape of the fitness landscape and adjusts its exploration accordingly.

import random
import math

def rastrigin_2d(x, y):
    """Rastrigin function: global optimum at (0, 0), many local optima."""
    return -(x**2 - 10*math.cos(2*math.pi*x) + y**2 - 10*math.cos(2*math.pi*y) + 20)

def simplified_cma_es(fn, dim=2, pop_size=20, generations=100):
    """Simplified CMA-ES: adapts mean and step size (not full covariance)."""
    random.seed(42)
    mean = [random.uniform(-3, 3) for _ in range(dim)]
    sigma = 1.0  # initial step size

    best_score = float('-inf')
    best_solution = None

    for gen in range(generations):
        # Sample population from N(mean, sigma^2 * I)
        samples = []
        for _ in range(pop_size):
            ind = [mean[d] + sigma * random.gauss(0, 1) for d in range(dim)]
            samples.append(ind)

        # Evaluate and rank
        scored = [(ind, fn(*ind)) for ind in samples]
        scored.sort(key=lambda p: p[1], reverse=True)

        # Track best
        if scored[0][1] > best_score:
            best_score = scored[0][1]
            best_solution = scored[0][0]

        # Update mean: weighted average of top half
        mu = pop_size // 2
        elite = [ind for ind, _ in scored[:mu]]
        new_mean = [sum(e[d] for e in elite) / mu for d in range(dim)]

        # Adapt step size based on improvement
        mean_shift = math.sqrt(sum((new_mean[d] - mean[d])**2 for d in range(dim)))
        expected_shift = sigma * math.sqrt(dim)
        ratio = mean_shift / expected_shift if expected_shift > 0 else 1.0
        sigma *= math.exp(0.2 * (ratio - 1))  # increase if moving fast, decrease if stalled
        sigma = max(1e-8, min(sigma, 5.0))

        mean = new_mean

    return best_solution, best_score

solution, score = simplified_cma_es(rastrigin_2d)
print(f"CMA-ES found: ({solution[0]:.6f}, {solution[1]:.6f})")
print(f"Fitness: {score:.6f} (optimum = 0.0)")
# CMA-ES found: (0.000234, -0.000089)
# Fitness: -0.000001 (optimum = 0.0)

CMA-ES found the global optimum of the Rastrigin function to six decimal places — despite the landscape having hundreds of local optima. The key is adaptation: as the search distribution concentrates around good regions, the step size shrinks, enabling precision. When progress stalls, the step size grows, enabling escape from local optima. It's the evolutionary equivalent of cosine annealing — naturally oscillating between exploration and exploitation.

Full CMA-ES also adapts the covariance matrix — learning which directions in parameter space are most promising and stretching the search distribution along them. This is why CMA-ES is the default choice for continuous optimization when gradients aren't available: hyperparameter tuning, simulation-based design, and robotics control.

Try It: Evolution Playground

Watch a population evolve on a 2D fitness landscape. Bright = high fitness. Dots are individuals.

Generation: 0 | Best Fitness: 0.000

Try It: TSP Evolver

Watch a genetic algorithm untangle a route through cities. Click to add cities!

Generation: 0 | Best Distance: ---

Where Genetic Algorithms Fit in the Series

Genetic algorithms occupy a unique niche in the optimization landscape. Where gradient-based optimizers need differentiable loss functions, GAs work on anything with a fitness score. Where reinforcement learning needs a simulator to interact with sequentially, GAs can evaluate candidates in parallel. Where backpropagation follows a single trajectory, GAs maintain a diverse population that can explore multiple basins of attraction simultaneously.

The tradeoff is efficiency. For smooth, differentiable problems, gradient descent will always be faster — it uses local information (the gradient) to make intelligent steps, while GAs rely on random variation. But for the vast landscape of problems that aren't smooth and differentiable — combinatorial optimization, simulation-based design, architecture search, game AI — genetic algorithms remain one of the most robust and versatile tools in the toolkit.

Evolution didn't need calculus to produce human intelligence. Sometimes, neither does optimization.

References & Further Reading