← Back to Blog

Neural Architecture Search from Scratch

1. The Architecture Design Problem

You've built a neural network. Four convolutional layers, 3×3 kernels, batch norm, ReLU activations, and a skip connection from layer 2 to layer 4. It works... okay. But what if you used 5×5 kernels in the first two layers? Or swapped ReLU for SiLU? Or added dilated convolutions? There are thousands of plausible architectures — and the one you hand-designed is almost certainly not the best one.

What if a machine could search over all of them automatically?

That's Neural Architecture Search (NAS). In 2017, Zoph and Le at Google Brain demonstrated that a reinforcement learning controller could discover convolutional architectures that beat hand-designed networks on CIFAR-10. The catch? It required 800 GPUs running for 28 days — roughly 22,400 GPU-days. Since then, the field has achieved a stunning 10,000× reduction in search cost, from GPU-years down to seconds.

Every NAS method has three components:

  1. Search space — what architectures are possible. The most common design is a cell-based search space (Zoph et al. 2018): instead of searching the entire network, you search for a small repeatable "cell" — a directed acyclic graph (DAG) of operations. Stack cells to build the full network.
  2. Search strategy — how to explore efficiently. Random sampling, evolutionary algorithms, reinforcement learning, or gradient-based optimization.
  3. Performance estimation — how to evaluate a candidate without fully training it. Shorter training runs, weight sharing, or zero-cost proxies.

Let's start by defining a search space and seeing how the combinatorics explode.

import numpy as np

# Cell-based search space: a DAG with N intermediate nodes
# Each node picks 2 inputs and an operation for each input
OPERATIONS = ['conv3x3', 'conv5x5', 'maxpool', 'skip', 'none']
NUM_NODES = 4  # intermediate nodes in one cell

def sample_architecture(rng):
    """Sample a random cell architecture."""
    cell = []
    for node_idx in range(NUM_NODES):
        # Each node picks 2 inputs from {input_0, input_1, prev_nodes...}
        num_inputs = node_idx + 2  # node 0 can pick from 2, node 1 from 3, etc.
        input_1 = rng.integers(0, num_inputs)
        input_2 = rng.integers(0, num_inputs)
        op_1 = rng.choice(OPERATIONS)
        op_2 = rng.choice(OPERATIONS)
        cell.append(((input_1, op_1), (input_2, op_2)))
    return cell

# How big is this search space?
total = 1
for node_idx in range(NUM_NODES):
    num_inputs = node_idx + 2
    choices_per_node = (num_inputs * len(OPERATIONS)) ** 2
    total *= choices_per_node
    print(f"Node {node_idx}: {num_inputs} inputs x {len(OPERATIONS)} ops "
          f"= {choices_per_node} combos")
print(f"\nTotal search space: {total:,} architectures")
# Node 0: 2 inputs x 5 ops = 100 combos
# Node 1: 3 inputs x 5 ops = 225 combos
# Node 2: 4 inputs x 5 ops = 400 combos
# Node 3: 5 inputs x 5 ops = 625 combos
# Total search space: 5,625,000,000 architectures

Over 5.6 billion possible cell architectures from just 4 nodes and 5 operations. And we haven't even counted the full network design (how many cells to stack, which are Normal vs. Reduction cells). Manual exploration of this space is like searching for a needle in a haystack by hand. Let's see what happens when we automate it.

Try It: Architecture Space Explorer

Click operation buttons on each edge to build a cell architecture. Watch the parameter count and search space size update live.

2. Random Search — The Surprisingly Strong Baseline

Before building sophisticated search algorithms, we need a baseline. In 2020, Li and Talwalkar published a surprising result: random search, properly tuned, matches or beats many elaborate NAS methods on standard benchmarks. The insight was that most of the performance gains in early NAS papers came from clever search space design, not the search algorithm itself. When you constrain the space well, even random samples are good.

The simplest approach: uniformly sample K architectures, train each for a fixed number of epochs, and pick the best. But there's a smarter way to allocate your compute budget.

Successive halving (Jamieson & Talwalkar 2016, later refined as Hyperband): start with many candidates training for a short time, then periodically eliminate the bottom half and give the survivors more training budget. This combines the breadth of random search (trying many architectures) with the depth of full training (for the most promising ones).

import numpy as np

def synthetic_fitness(arch, max_epochs, rng):
    """Simulate training: architecture quality + noise that decreases with epochs."""
    # True quality: sum of operation 'goodness' scores
    op_scores = {'conv3x3': 0.8, 'conv5x5': 0.7, 'maxpool': 0.3,
                 'skip': 0.5, 'none': 0.0}
    quality = sum(op_scores[op] for node in arch for (_, op) in node)
    quality = quality / (2 * len(arch))  # normalize to [0, 1]
    # Noisy estimate: noise shrinks with more training
    noise = rng.normal(0, 0.15 / np.sqrt(max_epochs))
    return np.clip(quality + noise, 0, 1)

def successive_halving(n_initial, max_budget, rng):
    """Random search with successive halving."""
    archs = [sample_architecture(rng) for _ in range(n_initial)]
    budget = max_budget // (int(np.log2(n_initial)) + 1)  # per-round budget

    round_num = 0
    while len(archs) > 1:
        # Evaluate all surviving architectures at current budget
        scores = [synthetic_fitness(a, budget * (round_num + 1), rng)
                  for a in archs]
        # Keep the top half
        ranked = sorted(zip(scores, archs), reverse=True)
        archs = [a for _, a in ranked[:len(archs) // 2]]
        print(f"Round {round_num}: {len(ranked)} archs evaluated, "
              f"best={ranked[0][0]:.3f}, kept top {len(archs)}")
        round_num += 1

    return archs[0]

rng = np.random.default_rng(42)
best = successive_halving(n_initial=16, max_budget=100, rng=rng)
print(f"Winner: {[(op1, op2) for ((_,op1),(_,op2)) in best]}")
# Round 0: 16 archs evaluated, best=0.812, kept top 8
# Round 1: 8 archs evaluated, best=0.791, kept top 4
# Round 2: 4 archs evaluated, best=0.815, kept top 2
# Round 3: 2 archs evaluated, best=0.836, kept top 1
# Winner: [('conv3x3','conv3x3'), ('conv5x5','conv3x3'), ...]

Successive halving is remarkably efficient. With a budget that could fully train 4 architectures, it samples 16 and still identifies a strong candidate. The key assumption is rank consistency: architectures that look good after short training tend to look good after long training. This holds reasonably well in practice, making random search + early stopping a formidable baseline.

3. Evolutionary Architecture Search

If random search is our baseline, can we do better by learning from previous evaluations? Evolutionary algorithms apply the same principles that drive biological evolution — mutation, selection, and survival of the fittest — to the architecture search problem.

Real et al. (2019) demonstrated this with AmoebaNet: an evolutionary approach that matched or exceeded the RL-based NAS on ImageNet while being conceptually simpler. Their key innovation was regularized evolution — instead of removing the worst-performing member from the population, they remove the oldest member. This "aging" mechanism prevents early lucky architectures from dominating the population forever, encouraging continuous exploration.

The algorithm works like this: maintain a population of N architectures, each with a recorded fitness (validation accuracy). To generate a new candidate, run a tournament: randomly sample K members, pick the fittest as the parent, apply a random mutation (change one operation, rewire one connection), train the child, and add it to the population. Remove the oldest member to keep the population size fixed.

import numpy as np
from copy import deepcopy

def mutate(arch, rng):
    """Apply one random mutation to an architecture."""
    child = deepcopy(arch)
    node_idx = rng.integers(0, len(child))
    branch = rng.integers(0, 2)  # mutate first or second input
    mutation_type = rng.choice(['op', 'input'])

    if mutation_type == 'op':
        new_op = rng.choice(OPERATIONS)
        inp, _ = child[node_idx][branch]
        child[node_idx] = list(child[node_idx])
        child[node_idx][branch] = (inp, new_op)
        child[node_idx] = tuple(child[node_idx])
    else:
        num_inputs = node_idx + 2
        new_inp = rng.integers(0, num_inputs)
        _, op = child[node_idx][branch]
        child[node_idx] = list(child[node_idx])
        child[node_idx][branch] = (new_inp, op)
        child[node_idx] = tuple(child[node_idx])
    return child

def evolutionary_search(pop_size, generations, tournament_k, rng):
    """Regularized evolution for architecture search."""
    # Initialize population with random architectures
    population = []
    for _ in range(pop_size):
        arch = sample_architecture(rng)
        fitness = synthetic_fitness(arch, max_epochs=50, rng=rng)
        population.append((arch, fitness))

    history = [max(f for _, f in population)]

    for gen in range(generations):
        # Tournament selection: pick best of K random members
        candidates = [population[i] for i in
                      rng.choice(len(population), size=tournament_k, replace=False)]
        parent_arch, _ = max(candidates, key=lambda x: x[1])

        # Mutate parent to create child
        child_arch = mutate(parent_arch, rng)
        child_fitness = synthetic_fitness(child_arch, max_epochs=50, rng=rng)

        # Add child, remove oldest (regularized evolution)
        population.append((child_arch, child_fitness))
        population.pop(0)  # remove oldest

        best = max(f for _, f in population)
        history.append(best)

    return max(population, key=lambda x: x[1]), history

rng = np.random.default_rng(42)
(best_arch, best_fit), history = evolutionary_search(
    pop_size=20, generations=100, tournament_k=5, rng=rng)
print(f"Best fitness: {best_fit:.3f}")
print(f"Improvement: {history[0]:.3f} -> {history[-1]:.3f}")
# Best fitness: 0.858
# Improvement: 0.777 -> 0.858

Evolution steadily improves the population by making small, targeted mutations to good architectures. The fitness landscape of neural architectures turns out to be surprisingly smooth — small mutations to a good architecture usually produce a decent architecture. This locality property is what makes evolutionary hill-climbing effective. In Real et al.'s controlled comparison, evolution matched RL-based search in final accuracy and converged faster in the early stages.

4. Differentiable Architecture Search — DARTS

Here's the problem with both random and evolutionary search: they treat the architecture as a black box. Each evaluation requires training a candidate from scratch, so you can only try a few hundred to a few thousand architectures. Liu et al. (2019) had an elegant insight: what if we could take gradients through the architecture itself?

This is DARTS (Differentiable Architecture Search), and it reduced search cost from thousands of GPU-days to about 1.5 GPU-days on a single GPU — a 3-orders-of-magnitude improvement.

The trick is a continuous relaxation. Instead of choosing one operation per edge (a discrete choice), DARTS computes a softmax-weighted mixture of all candidate operations:

output(x) = ∑i αi · opi(x), where α = softmax(θ)

Here θ are learnable architecture parameters, separate from the network weights w. DARTS optimizes both simultaneously using bilevel optimization: update w on the training set (to learn good features), then update θ on the validation set (to select the best operations). Using the validation set for θ prevents the search from choosing architectures that simply overfit.

After the search converges, we discretize: for each edge, keep only the operation with the highest α weight. Then retrain the resulting discrete architecture from scratch.

import numpy as np

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

def darts_search(n_edges, n_ops, steps, lr_w, lr_alpha, rng):
    """Simplified DARTS: bilevel optimization of weights and arch params."""
    # Architecture parameters: one logit vector per edge
    alphas = [np.zeros(n_ops) for _ in range(n_edges)]

    # Simulated network weights (one weight per edge-operation pair)
    weights = [[rng.normal(0, 0.1) for _ in range(n_ops)]
               for _ in range(n_edges)]

    # Target: conv3x3 (idx 0) and skip (idx 3) are the best operations
    target_ops = [0, 3, 0, 0, 3, 0]  # best op index per edge

    history = []
    for step in range(steps):
        # Step 1: Update weights on "training loss"
        for e in range(n_edges):
            probs = softmax(alphas[e])
            for i in range(n_ops):
                grad = 2 * probs[i] * (sum(probs[j] * weights[e][j]
                       for j in range(n_ops)) - 1.0)
                weights[e][i] -= lr_w * grad

        # Step 2: Update architecture params on "validation loss"
        for e in range(n_edges):
            probs = softmax(alphas[e])
            for i in range(n_ops):
                grad_a = probs[i] * (1.0 if i != target_ops[e] else
                         (1.0 - 1.0 / (probs[i] + 1e-8)))
                alphas[e][i] -= lr_alpha * grad_a

        # Record which operation is selected per edge
        selected = [OPERATIONS[np.argmax(a)] for a in alphas]
        history.append(selected)

    # Final architecture: argmax of alpha per edge
    final = [OPERATIONS[np.argmax(a)] for a in alphas]
    final_probs = [softmax(a) for a in alphas]
    return final, final_probs, history

rng = np.random.default_rng(42)
arch, probs, hist = darts_search(
    n_edges=6, n_ops=5, steps=200, lr_w=0.01, lr_alpha=0.05, rng=rng)
print("Final architecture:", arch)
print("Edge 0 probs:", {OPERATIONS[i]: f"{p:.3f}"
      for i, p in enumerate(probs[0])})
# Final architecture: ['conv3x3', 'skip', 'conv3x3', 'conv3x3', 'skip', 'conv3x3']
# Edge 0 probs: {'conv3x3': 0.981, 'conv5x5': 0.004, ...}

Watch what happens to the architecture parameters during search: they start uniform (each operation has equal weight ~0.2) and gradually become sharply peaked, with the best operation getting >95% of the weight. This is the continuous relaxation "deciding" which operations to keep. DARTS achieved 2.76% test error on CIFAR-10 — competitive with evolution and RL methods that cost 1000× more compute.

The tradeoff: DARTS requires holding all candidate operations in GPU memory simultaneously (since it computes a weighted sum of all), which limits the search space size. It can also suffer from "architecture collapse" — converging to skip connections everywhere because they're the easiest to train through, not the most expressive. Variants like Progressive DARTS and Fair DARTS address these failure modes.

Try It: NAS Strategy Comparison

Compare Random Search, Evolutionary Search, and DARTS-style Gradient Search on a 2D fitness landscape. Adjust the landscape ruggedness and watch how each strategy explores differently.

5. Weight Sharing and One-Shot NAS

Even with DARTS, we're training a full network during search. The ultimate efficiency trick is weight sharing: train a single "supernet" that contains all possible architectures as subgraphs. Then evaluate any candidate by simply running a forward pass with the shared weights — no retraining needed.

Pham et al. (2018) demonstrated this with ENAS (Efficient Neural Architecture Search), reducing the cost to roughly 0.5 GPU-days — about 1000× cheaper than the original NAS. The idea: the supernet is a large DAG where every possible edge-operation pair has its own parameters. To evaluate a specific architecture, you "activate" only its subgraph and forward data through those specific operations and weights.

The training loop alternates between two steps: (1) sample a random architecture (subgraph) from the supernet, (2) forward-backward through only the sampled operations, updating only those weights. After thousands of iterations, every operation's weights have been trained across many different architectural contexts.

import numpy as np

class Supernet:
    """A supernet with shared weights for all candidate architectures."""
    def __init__(self, n_edges, n_ops, rng):
        self.n_edges = n_edges
        self.n_ops = n_ops
        # Shared weights: one parameter per (edge, operation) pair
        self.weights = rng.normal(0, 0.1, (n_edges, n_ops))
        # Track how often each weight is updated
        self.update_counts = np.zeros((n_edges, n_ops))

    def forward(self, arch_indices, x=1.0):
        """Forward pass through a specific architecture (subgraph)."""
        output = x
        for edge, op_idx in enumerate(arch_indices):
            output = output * self.weights[edge, op_idx]
        return output

    def train_step(self, arch_indices, target, lr=0.01):
        """Train only the weights used by this architecture."""
        pred = self.forward(arch_indices)
        loss = (pred - target) ** 2

        # Gradient for each active weight
        for edge, op_idx in enumerate(arch_indices):
            grad = 2 * (pred - target)
            # Chain rule: d(loss)/d(w_i) = grad * product of other weights
            for e2, o2 in enumerate(arch_indices):
                if e2 != edge:
                    grad *= self.weights[e2, o2]
            self.weights[edge, op_idx] -= lr * np.clip(grad, -1, 1)
            self.update_counts[edge, op_idx] += 1
        return loss

def train_supernet(n_edges, n_ops, n_steps, rng):
    """Train supernet by sampling random architectures each step."""
    supernet = Supernet(n_edges, n_ops, rng)
    target = 0.5  # target output

    for step in range(n_steps):
        # Sample random architecture
        arch = [rng.integers(0, n_ops) for _ in range(n_edges)]
        supernet.train_step(arch, target, lr=0.005)

    # Evaluate all single-edge architectures by shared weights
    rankings = {}
    for op_idx in range(n_ops):
        arch = [op_idx] * n_edges  # same op everywhere
        score = abs(supernet.forward(arch) - target)
        rankings[OPERATIONS[op_idx]] = 1.0 - min(score, 1.0)

    return supernet, rankings

rng = np.random.default_rng(42)
supernet, rankings = train_supernet(n_edges=4, n_ops=5, n_steps=2000, rng=rng)
print("Supernet rankings (higher = closer to target):")
for op, score in sorted(rankings.items(), key=lambda x: -x[1]):
    print(f"  {op:<10s} {score:.3f}")
print(f"\nTotal weight updates: {int(supernet.update_counts.sum())}")
# Supernet rankings (higher = closer to target):
#   conv3x3    0.923
#   skip       0.867
#   conv5x5    0.845
#   maxpool    0.712
#   none       0.534
# Total weight updates: 8000

The supernet trains all 20 operation-weights (4 edges × 5 operations) using just 2000 random architecture samples, with 8000 total weight updates. The key question is whether these shared-weight rankings reliably predict stand-alone accuracy. In practice, the correlation is good but imperfect — a phenomenon called weight coupling, where different subgraphs interfere during shared training. Modern approaches like few-shot NAS (training multiple smaller supernets) and progressive shrinking mitigate this.

The one-shot pipeline then becomes: (1) train the supernet, (2) use any search strategy (random, evolutionary, RL) to find the best architecture according to supernet performance, (3) retrain only the winner from scratch. Step 2 is nearly free since it only requires forward passes.

6. Hardware-Aware NAS and Modern Approaches

So far we've optimized for one thing: accuracy. But in the real world, a 0.5% accuracy improvement means nothing if the model takes 10× longer to run on a phone. Hardware-aware NAS (Tan et al. 2019, MnasNet) optimizes a multi-objective reward that balances accuracy and real-world inference latency:

reward = accuracy(m) × [latency(m) / target]w

where w = -0.07 penalizes architectures that exceed the target latency. MnasNet measured latency directly on a Pixel phone — not FLOPs, not parameter count, but actual wall-clock time. The result: 75.2% top-1 ImageNet accuracy at 78ms, 1.8× faster than MobileNetV2 with higher accuracy.

This found the base architecture for EfficientNet (Tan & Le 2019), which introduced compound scaling: once NAS finds a good small architecture (B0, 77.1% accuracy), scale width, depth, and resolution together using a single coefficient. EfficientNet-B7 reached 84.4% top-1 accuracy while being 8.4× smaller than the previous state-of-art.

The most dramatic acceleration came from zero-cost proxies — predicting architecture quality without any training. SynFlow (Tanaka et al. 2020) computes a score from parameter magnitudes in a single forward-backward pass, requiring no training data at all. NASWOT (Mellor et al. 2021) measures how separable different inputs are in the network's activation space using a single minibatch. These proxies achieve ~0.82 Spearman rank correlation with full training accuracy — enough to filter candidates in seconds.

import numpy as np

def hardware_aware_search(pop_size, generations, target_latency, rng):
    """Evolutionary NAS with multi-objective: accuracy + latency."""
    # Latency cost per operation (milliseconds, simulated)
    op_latency = {'conv3x3': 2.5, 'conv5x5': 5.0, 'maxpool': 1.0,
                  'skip': 0.1, 'none': 0.0}
    # Accuracy contribution per operation
    op_quality = {'conv3x3': 0.85, 'conv5x5': 0.80, 'maxpool': 0.35,
                  'skip': 0.50, 'none': 0.05}

    def arch_latency(arch):
        return sum(op_latency[op] for node in arch for (_, op) in node)

    def arch_accuracy(arch):
        quality = sum(op_quality[op] for node in arch for (_, op) in node)
        return quality / (2 * len(arch)) + rng.normal(0, 0.02)

    def reward(arch):
        acc = arch_accuracy(arch)
        lat = arch_latency(arch)
        # Penalize if latency exceeds target
        lat_penalty = (lat / target_latency) ** (-0.07) if lat > 0 else 1.0
        return acc * lat_penalty

    # Evolution with multi-objective reward
    population = [(sample_architecture(rng), None) for _ in range(pop_size)]
    population = [(a, reward(a)) for a, _ in population]

    for gen in range(generations):
        # Tournament selection
        idxs = rng.choice(len(population), size=5, replace=False)
        parent = max([population[i] for i in idxs], key=lambda x: x[1])[0]
        child = mutate(parent, rng)
        child_reward = reward(child)
        population.append((child, child_reward))
        population.pop(0)

    # Final Pareto analysis
    all_results = [(arch_accuracy(a), arch_latency(a), a) for a, _ in population]
    all_results.sort(key=lambda x: x[1])  # sort by latency

    print(f"{'Accuracy':>10s} {'Latency':>10s}  Architecture ops")
    for acc, lat, arch in all_results[:5]:
        ops = [op for node in arch for (_, op) in node]
        print(f"{acc:10.3f} {lat:8.1f}ms  {ops}")

    return all_results

rng = np.random.default_rng(42)
results = hardware_aware_search(
    pop_size=20, generations=80, target_latency=15.0, rng=rng)
#   Accuracy    Latency  Architecture ops
#      0.468      1.2ms  ['skip', 'none', 'maxpool', 'skip', ...]
#      0.529      3.7ms  ['skip', 'skip', 'maxpool', 'conv3x3', ...]
#      0.710     10.5ms  ['conv3x3', 'conv3x3', 'skip', 'maxpool', ...]
#      0.802     16.3ms  ['conv3x3', 'conv5x5', 'conv3x3', 'skip', ...]
#      0.831     22.8ms  ['conv3x3', 'conv5x5', 'conv3x3', 'conv3x3', ...]

The Pareto frontier reveals the fundamental tradeoff: fast architectures (lots of skip connections and pooling) sacrifice accuracy, while accurate architectures (lots of convolutions) are slower. Hardware-aware NAS finds the sweet spot automatically. The compute timeline tells the story of the entire field:

A 10,000× reduction in four years. Today, NAS has expanded beyond vision: searching for Transformer architectures (AutoFormer), using LLMs to propose architectures (LEMONADE), and building "once-for-all" networks that adapt to any hardware constraint without retraining.

7. Conclusion

We've built Neural Architecture Search from the ground up: defined a cell-based search space with billions of candidates, established random search as a strong baseline, improved on it with evolutionary search (mutation + tournament selection), then made the leap to gradient-based DARTS by relaxing discrete choices into a continuous optimization. Weight sharing and zero-cost proxies pushed the cost from GPU-years to seconds.

The central insight is that architecture design is itself an optimization problem — and optimization is what machines do best. But NAS doesn't replace the human ML engineer. Someone still needs to design the search space, choose what to optimize for, curate the training data, and decide whether that extra 0.3% accuracy is worth 2× the inference cost. NAS amplifies expertise; it doesn't substitute for it.

References & Further Reading