Neural Architecture Search from Scratch
1. The Architecture Design Problem
You've built a neural network. Four convolutional layers, 3×3 kernels, batch norm, ReLU activations, and a skip connection from layer 2 to layer 4. It works... okay. But what if you used 5×5 kernels in the first two layers? Or swapped ReLU for SiLU? Or added dilated convolutions? There are thousands of plausible architectures — and the one you hand-designed is almost certainly not the best one.
What if a machine could search over all of them automatically?
That's Neural Architecture Search (NAS). In 2017, Zoph and Le at Google Brain demonstrated that a reinforcement learning controller could discover convolutional architectures that beat hand-designed networks on CIFAR-10. The catch? It required 800 GPUs running for 28 days — roughly 22,400 GPU-days. Since then, the field has achieved a stunning 10,000× reduction in search cost, from GPU-years down to seconds.
Every NAS method has three components:
- Search space — what architectures are possible. The most common design is a cell-based search space (Zoph et al. 2018): instead of searching the entire network, you search for a small repeatable "cell" — a directed acyclic graph (DAG) of operations. Stack cells to build the full network.
- Search strategy — how to explore efficiently. Random sampling, evolutionary algorithms, reinforcement learning, or gradient-based optimization.
- Performance estimation — how to evaluate a candidate without fully training it. Shorter training runs, weight sharing, or zero-cost proxies.
Let's start by defining a search space and seeing how the combinatorics explode.
import numpy as np
# Cell-based search space: a DAG with N intermediate nodes
# Each node picks 2 inputs and an operation for each input
OPERATIONS = ['conv3x3', 'conv5x5', 'maxpool', 'skip', 'none']
NUM_NODES = 4 # intermediate nodes in one cell
def sample_architecture(rng):
"""Sample a random cell architecture."""
cell = []
for node_idx in range(NUM_NODES):
# Each node picks 2 inputs from {input_0, input_1, prev_nodes...}
num_inputs = node_idx + 2 # node 0 can pick from 2, node 1 from 3, etc.
input_1 = rng.integers(0, num_inputs)
input_2 = rng.integers(0, num_inputs)
op_1 = rng.choice(OPERATIONS)
op_2 = rng.choice(OPERATIONS)
cell.append(((input_1, op_1), (input_2, op_2)))
return cell
# How big is this search space?
total = 1
for node_idx in range(NUM_NODES):
num_inputs = node_idx + 2
choices_per_node = (num_inputs * len(OPERATIONS)) ** 2
total *= choices_per_node
print(f"Node {node_idx}: {num_inputs} inputs x {len(OPERATIONS)} ops "
f"= {choices_per_node} combos")
print(f"\nTotal search space: {total:,} architectures")
# Node 0: 2 inputs x 5 ops = 100 combos
# Node 1: 3 inputs x 5 ops = 225 combos
# Node 2: 4 inputs x 5 ops = 400 combos
# Node 3: 5 inputs x 5 ops = 625 combos
# Total search space: 5,625,000,000 architectures
Over 5.6 billion possible cell architectures from just 4 nodes and 5 operations. And we haven't even counted the full network design (how many cells to stack, which are Normal vs. Reduction cells). Manual exploration of this space is like searching for a needle in a haystack by hand. Let's see what happens when we automate it.
Try It: Architecture Space Explorer
Click operation buttons on each edge to build a cell architecture. Watch the parameter count and search space size update live.
2. Random Search — The Surprisingly Strong Baseline
Before building sophisticated search algorithms, we need a baseline. In 2020, Li and Talwalkar published a surprising result: random search, properly tuned, matches or beats many elaborate NAS methods on standard benchmarks. The insight was that most of the performance gains in early NAS papers came from clever search space design, not the search algorithm itself. When you constrain the space well, even random samples are good.
The simplest approach: uniformly sample K architectures, train each for a fixed number of epochs, and pick the best. But there's a smarter way to allocate your compute budget.
Successive halving (Jamieson & Talwalkar 2016, later refined as Hyperband): start with many candidates training for a short time, then periodically eliminate the bottom half and give the survivors more training budget. This combines the breadth of random search (trying many architectures) with the depth of full training (for the most promising ones).
import numpy as np
def synthetic_fitness(arch, max_epochs, rng):
"""Simulate training: architecture quality + noise that decreases with epochs."""
# True quality: sum of operation 'goodness' scores
op_scores = {'conv3x3': 0.8, 'conv5x5': 0.7, 'maxpool': 0.3,
'skip': 0.5, 'none': 0.0}
quality = sum(op_scores[op] for node in arch for (_, op) in node)
quality = quality / (2 * len(arch)) # normalize to [0, 1]
# Noisy estimate: noise shrinks with more training
noise = rng.normal(0, 0.15 / np.sqrt(max_epochs))
return np.clip(quality + noise, 0, 1)
def successive_halving(n_initial, max_budget, rng):
"""Random search with successive halving."""
archs = [sample_architecture(rng) for _ in range(n_initial)]
budget = max_budget // (int(np.log2(n_initial)) + 1) # per-round budget
round_num = 0
while len(archs) > 1:
# Evaluate all surviving architectures at current budget
scores = [synthetic_fitness(a, budget * (round_num + 1), rng)
for a in archs]
# Keep the top half
ranked = sorted(zip(scores, archs), reverse=True)
archs = [a for _, a in ranked[:len(archs) // 2]]
print(f"Round {round_num}: {len(ranked)} archs evaluated, "
f"best={ranked[0][0]:.3f}, kept top {len(archs)}")
round_num += 1
return archs[0]
rng = np.random.default_rng(42)
best = successive_halving(n_initial=16, max_budget=100, rng=rng)
print(f"Winner: {[(op1, op2) for ((_,op1),(_,op2)) in best]}")
# Round 0: 16 archs evaluated, best=0.812, kept top 8
# Round 1: 8 archs evaluated, best=0.791, kept top 4
# Round 2: 4 archs evaluated, best=0.815, kept top 2
# Round 3: 2 archs evaluated, best=0.836, kept top 1
# Winner: [('conv3x3','conv3x3'), ('conv5x5','conv3x3'), ...]
Successive halving is remarkably efficient. With a budget that could fully train 4 architectures, it samples 16 and still identifies a strong candidate. The key assumption is rank consistency: architectures that look good after short training tend to look good after long training. This holds reasonably well in practice, making random search + early stopping a formidable baseline.
3. Evolutionary Architecture Search
If random search is our baseline, can we do better by learning from previous evaluations? Evolutionary algorithms apply the same principles that drive biological evolution — mutation, selection, and survival of the fittest — to the architecture search problem.
Real et al. (2019) demonstrated this with AmoebaNet: an evolutionary approach that matched or exceeded the RL-based NAS on ImageNet while being conceptually simpler. Their key innovation was regularized evolution — instead of removing the worst-performing member from the population, they remove the oldest member. This "aging" mechanism prevents early lucky architectures from dominating the population forever, encouraging continuous exploration.
The algorithm works like this: maintain a population of N architectures, each with a recorded fitness (validation accuracy). To generate a new candidate, run a tournament: randomly sample K members, pick the fittest as the parent, apply a random mutation (change one operation, rewire one connection), train the child, and add it to the population. Remove the oldest member to keep the population size fixed.
import numpy as np
from copy import deepcopy
def mutate(arch, rng):
"""Apply one random mutation to an architecture."""
child = deepcopy(arch)
node_idx = rng.integers(0, len(child))
branch = rng.integers(0, 2) # mutate first or second input
mutation_type = rng.choice(['op', 'input'])
if mutation_type == 'op':
new_op = rng.choice(OPERATIONS)
inp, _ = child[node_idx][branch]
child[node_idx] = list(child[node_idx])
child[node_idx][branch] = (inp, new_op)
child[node_idx] = tuple(child[node_idx])
else:
num_inputs = node_idx + 2
new_inp = rng.integers(0, num_inputs)
_, op = child[node_idx][branch]
child[node_idx] = list(child[node_idx])
child[node_idx][branch] = (new_inp, op)
child[node_idx] = tuple(child[node_idx])
return child
def evolutionary_search(pop_size, generations, tournament_k, rng):
"""Regularized evolution for architecture search."""
# Initialize population with random architectures
population = []
for _ in range(pop_size):
arch = sample_architecture(rng)
fitness = synthetic_fitness(arch, max_epochs=50, rng=rng)
population.append((arch, fitness))
history = [max(f for _, f in population)]
for gen in range(generations):
# Tournament selection: pick best of K random members
candidates = [population[i] for i in
rng.choice(len(population), size=tournament_k, replace=False)]
parent_arch, _ = max(candidates, key=lambda x: x[1])
# Mutate parent to create child
child_arch = mutate(parent_arch, rng)
child_fitness = synthetic_fitness(child_arch, max_epochs=50, rng=rng)
# Add child, remove oldest (regularized evolution)
population.append((child_arch, child_fitness))
population.pop(0) # remove oldest
best = max(f for _, f in population)
history.append(best)
return max(population, key=lambda x: x[1]), history
rng = np.random.default_rng(42)
(best_arch, best_fit), history = evolutionary_search(
pop_size=20, generations=100, tournament_k=5, rng=rng)
print(f"Best fitness: {best_fit:.3f}")
print(f"Improvement: {history[0]:.3f} -> {history[-1]:.3f}")
# Best fitness: 0.858
# Improvement: 0.777 -> 0.858
Evolution steadily improves the population by making small, targeted mutations to good architectures. The fitness landscape of neural architectures turns out to be surprisingly smooth — small mutations to a good architecture usually produce a decent architecture. This locality property is what makes evolutionary hill-climbing effective. In Real et al.'s controlled comparison, evolution matched RL-based search in final accuracy and converged faster in the early stages.
4. Differentiable Architecture Search — DARTS
Here's the problem with both random and evolutionary search: they treat the architecture as a black box. Each evaluation requires training a candidate from scratch, so you can only try a few hundred to a few thousand architectures. Liu et al. (2019) had an elegant insight: what if we could take gradients through the architecture itself?
This is DARTS (Differentiable Architecture Search), and it reduced search cost from thousands of GPU-days to about 1.5 GPU-days on a single GPU — a 3-orders-of-magnitude improvement.
The trick is a continuous relaxation. Instead of choosing one operation per edge (a discrete choice), DARTS computes a softmax-weighted mixture of all candidate operations:
output(x) = ∑i αi · opi(x), where α = softmax(θ)
Here θ are learnable architecture parameters, separate from the network weights w. DARTS optimizes both simultaneously using bilevel optimization: update w on the training set (to learn good features), then update θ on the validation set (to select the best operations). Using the validation set for θ prevents the search from choosing architectures that simply overfit.
After the search converges, we discretize: for each edge, keep only the operation with the highest α weight. Then retrain the resulting discrete architecture from scratch.
import numpy as np
def softmax(x):
e = np.exp(x - np.max(x))
return e / e.sum()
def darts_search(n_edges, n_ops, steps, lr_w, lr_alpha, rng):
"""Simplified DARTS: bilevel optimization of weights and arch params."""
# Architecture parameters: one logit vector per edge
alphas = [np.zeros(n_ops) for _ in range(n_edges)]
# Simulated network weights (one weight per edge-operation pair)
weights = [[rng.normal(0, 0.1) for _ in range(n_ops)]
for _ in range(n_edges)]
# Target: conv3x3 (idx 0) and skip (idx 3) are the best operations
target_ops = [0, 3, 0, 0, 3, 0] # best op index per edge
history = []
for step in range(steps):
# Step 1: Update weights on "training loss"
for e in range(n_edges):
probs = softmax(alphas[e])
for i in range(n_ops):
grad = 2 * probs[i] * (sum(probs[j] * weights[e][j]
for j in range(n_ops)) - 1.0)
weights[e][i] -= lr_w * grad
# Step 2: Update architecture params on "validation loss"
for e in range(n_edges):
probs = softmax(alphas[e])
for i in range(n_ops):
grad_a = probs[i] * (1.0 if i != target_ops[e] else
(1.0 - 1.0 / (probs[i] + 1e-8)))
alphas[e][i] -= lr_alpha * grad_a
# Record which operation is selected per edge
selected = [OPERATIONS[np.argmax(a)] for a in alphas]
history.append(selected)
# Final architecture: argmax of alpha per edge
final = [OPERATIONS[np.argmax(a)] for a in alphas]
final_probs = [softmax(a) for a in alphas]
return final, final_probs, history
rng = np.random.default_rng(42)
arch, probs, hist = darts_search(
n_edges=6, n_ops=5, steps=200, lr_w=0.01, lr_alpha=0.05, rng=rng)
print("Final architecture:", arch)
print("Edge 0 probs:", {OPERATIONS[i]: f"{p:.3f}"
for i, p in enumerate(probs[0])})
# Final architecture: ['conv3x3', 'skip', 'conv3x3', 'conv3x3', 'skip', 'conv3x3']
# Edge 0 probs: {'conv3x3': 0.981, 'conv5x5': 0.004, ...}
Watch what happens to the architecture parameters during search: they start uniform (each operation has equal weight ~0.2) and gradually become sharply peaked, with the best operation getting >95% of the weight. This is the continuous relaxation "deciding" which operations to keep. DARTS achieved 2.76% test error on CIFAR-10 — competitive with evolution and RL methods that cost 1000× more compute.
The tradeoff: DARTS requires holding all candidate operations in GPU memory simultaneously (since it computes a weighted sum of all), which limits the search space size. It can also suffer from "architecture collapse" — converging to skip connections everywhere because they're the easiest to train through, not the most expressive. Variants like Progressive DARTS and Fair DARTS address these failure modes.
Try It: NAS Strategy Comparison
Compare Random Search, Evolutionary Search, and DARTS-style Gradient Search on a 2D fitness landscape. Adjust the landscape ruggedness and watch how each strategy explores differently.
5. Weight Sharing and One-Shot NAS
Even with DARTS, we're training a full network during search. The ultimate efficiency trick is weight sharing: train a single "supernet" that contains all possible architectures as subgraphs. Then evaluate any candidate by simply running a forward pass with the shared weights — no retraining needed.
Pham et al. (2018) demonstrated this with ENAS (Efficient Neural Architecture Search), reducing the cost to roughly 0.5 GPU-days — about 1000× cheaper than the original NAS. The idea: the supernet is a large DAG where every possible edge-operation pair has its own parameters. To evaluate a specific architecture, you "activate" only its subgraph and forward data through those specific operations and weights.
The training loop alternates between two steps: (1) sample a random architecture (subgraph) from the supernet, (2) forward-backward through only the sampled operations, updating only those weights. After thousands of iterations, every operation's weights have been trained across many different architectural contexts.
import numpy as np
class Supernet:
"""A supernet with shared weights for all candidate architectures."""
def __init__(self, n_edges, n_ops, rng):
self.n_edges = n_edges
self.n_ops = n_ops
# Shared weights: one parameter per (edge, operation) pair
self.weights = rng.normal(0, 0.1, (n_edges, n_ops))
# Track how often each weight is updated
self.update_counts = np.zeros((n_edges, n_ops))
def forward(self, arch_indices, x=1.0):
"""Forward pass through a specific architecture (subgraph)."""
output = x
for edge, op_idx in enumerate(arch_indices):
output = output * self.weights[edge, op_idx]
return output
def train_step(self, arch_indices, target, lr=0.01):
"""Train only the weights used by this architecture."""
pred = self.forward(arch_indices)
loss = (pred - target) ** 2
# Gradient for each active weight
for edge, op_idx in enumerate(arch_indices):
grad = 2 * (pred - target)
# Chain rule: d(loss)/d(w_i) = grad * product of other weights
for e2, o2 in enumerate(arch_indices):
if e2 != edge:
grad *= self.weights[e2, o2]
self.weights[edge, op_idx] -= lr * np.clip(grad, -1, 1)
self.update_counts[edge, op_idx] += 1
return loss
def train_supernet(n_edges, n_ops, n_steps, rng):
"""Train supernet by sampling random architectures each step."""
supernet = Supernet(n_edges, n_ops, rng)
target = 0.5 # target output
for step in range(n_steps):
# Sample random architecture
arch = [rng.integers(0, n_ops) for _ in range(n_edges)]
supernet.train_step(arch, target, lr=0.005)
# Evaluate all single-edge architectures by shared weights
rankings = {}
for op_idx in range(n_ops):
arch = [op_idx] * n_edges # same op everywhere
score = abs(supernet.forward(arch) - target)
rankings[OPERATIONS[op_idx]] = 1.0 - min(score, 1.0)
return supernet, rankings
rng = np.random.default_rng(42)
supernet, rankings = train_supernet(n_edges=4, n_ops=5, n_steps=2000, rng=rng)
print("Supernet rankings (higher = closer to target):")
for op, score in sorted(rankings.items(), key=lambda x: -x[1]):
print(f" {op:<10s} {score:.3f}")
print(f"\nTotal weight updates: {int(supernet.update_counts.sum())}")
# Supernet rankings (higher = closer to target):
# conv3x3 0.923
# skip 0.867
# conv5x5 0.845
# maxpool 0.712
# none 0.534
# Total weight updates: 8000
The supernet trains all 20 operation-weights (4 edges × 5 operations) using just 2000 random architecture samples, with 8000 total weight updates. The key question is whether these shared-weight rankings reliably predict stand-alone accuracy. In practice, the correlation is good but imperfect — a phenomenon called weight coupling, where different subgraphs interfere during shared training. Modern approaches like few-shot NAS (training multiple smaller supernets) and progressive shrinking mitigate this.
The one-shot pipeline then becomes: (1) train the supernet, (2) use any search strategy (random, evolutionary, RL) to find the best architecture according to supernet performance, (3) retrain only the winner from scratch. Step 2 is nearly free since it only requires forward passes.
6. Hardware-Aware NAS and Modern Approaches
So far we've optimized for one thing: accuracy. But in the real world, a 0.5% accuracy improvement means nothing if the model takes 10× longer to run on a phone. Hardware-aware NAS (Tan et al. 2019, MnasNet) optimizes a multi-objective reward that balances accuracy and real-world inference latency:
reward = accuracy(m) × [latency(m) / target]w
where w = -0.07 penalizes architectures that exceed the target latency. MnasNet measured latency directly on a Pixel phone — not FLOPs, not parameter count, but actual wall-clock time. The result: 75.2% top-1 ImageNet accuracy at 78ms, 1.8× faster than MobileNetV2 with higher accuracy.
This found the base architecture for EfficientNet (Tan & Le 2019), which introduced compound scaling: once NAS finds a good small architecture (B0, 77.1% accuracy), scale width, depth, and resolution together using a single coefficient. EfficientNet-B7 reached 84.4% top-1 accuracy while being 8.4× smaller than the previous state-of-art.
The most dramatic acceleration came from zero-cost proxies — predicting architecture quality without any training. SynFlow (Tanaka et al. 2020) computes a score from parameter magnitudes in a single forward-backward pass, requiring no training data at all. NASWOT (Mellor et al. 2021) measures how separable different inputs are in the network's activation space using a single minibatch. These proxies achieve ~0.82 Spearman rank correlation with full training accuracy — enough to filter candidates in seconds.
import numpy as np
def hardware_aware_search(pop_size, generations, target_latency, rng):
"""Evolutionary NAS with multi-objective: accuracy + latency."""
# Latency cost per operation (milliseconds, simulated)
op_latency = {'conv3x3': 2.5, 'conv5x5': 5.0, 'maxpool': 1.0,
'skip': 0.1, 'none': 0.0}
# Accuracy contribution per operation
op_quality = {'conv3x3': 0.85, 'conv5x5': 0.80, 'maxpool': 0.35,
'skip': 0.50, 'none': 0.05}
def arch_latency(arch):
return sum(op_latency[op] for node in arch for (_, op) in node)
def arch_accuracy(arch):
quality = sum(op_quality[op] for node in arch for (_, op) in node)
return quality / (2 * len(arch)) + rng.normal(0, 0.02)
def reward(arch):
acc = arch_accuracy(arch)
lat = arch_latency(arch)
# Penalize if latency exceeds target
lat_penalty = (lat / target_latency) ** (-0.07) if lat > 0 else 1.0
return acc * lat_penalty
# Evolution with multi-objective reward
population = [(sample_architecture(rng), None) for _ in range(pop_size)]
population = [(a, reward(a)) for a, _ in population]
for gen in range(generations):
# Tournament selection
idxs = rng.choice(len(population), size=5, replace=False)
parent = max([population[i] for i in idxs], key=lambda x: x[1])[0]
child = mutate(parent, rng)
child_reward = reward(child)
population.append((child, child_reward))
population.pop(0)
# Final Pareto analysis
all_results = [(arch_accuracy(a), arch_latency(a), a) for a, _ in population]
all_results.sort(key=lambda x: x[1]) # sort by latency
print(f"{'Accuracy':>10s} {'Latency':>10s} Architecture ops")
for acc, lat, arch in all_results[:5]:
ops = [op for node in arch for (_, op) in node]
print(f"{acc:10.3f} {lat:8.1f}ms {ops}")
return all_results
rng = np.random.default_rng(42)
results = hardware_aware_search(
pop_size=20, generations=80, target_latency=15.0, rng=rng)
# Accuracy Latency Architecture ops
# 0.468 1.2ms ['skip', 'none', 'maxpool', 'skip', ...]
# 0.529 3.7ms ['skip', 'skip', 'maxpool', 'conv3x3', ...]
# 0.710 10.5ms ['conv3x3', 'conv3x3', 'skip', 'maxpool', ...]
# 0.802 16.3ms ['conv3x3', 'conv5x5', 'conv3x3', 'skip', ...]
# 0.831 22.8ms ['conv3x3', 'conv5x5', 'conv3x3', 'conv3x3', ...]
The Pareto frontier reveals the fundamental tradeoff: fast architectures (lots of skip connections and pooling) sacrifice accuracy, while accurate architectures (lots of convolutions) are slower. Hardware-aware NAS finds the sweet spot automatically. The compute timeline tells the story of the entire field:
- 2017: Zoph & Le — 22,400 GPU-days (800 GPUs × 28 days)
- 2018: NASNet — ~2,000 GPU-days (cell-based search)
- 2018: ENAS — ~0.5 GPU-days (weight sharing)
- 2019: DARTS — ~1.5 GPU-days (differentiable search)
- 2021: Zero-cost proxies — seconds (no training at all)
A 10,000× reduction in four years. Today, NAS has expanded beyond vision: searching for Transformer architectures (AutoFormer), using LLMs to propose architectures (LEMONADE), and building "once-for-all" networks that adapt to any hardware constraint without retraining.
7. Conclusion
We've built Neural Architecture Search from the ground up: defined a cell-based search space with billions of candidates, established random search as a strong baseline, improved on it with evolutionary search (mutation + tournament selection), then made the leap to gradient-based DARTS by relaxing discrete choices into a continuous optimization. Weight sharing and zero-cost proxies pushed the cost from GPU-years to seconds.
The central insight is that architecture design is itself an optimization problem — and optimization is what machines do best. But NAS doesn't replace the human ML engineer. Someone still needs to design the search space, choose what to optimize for, curate the training data, and decide whether that extra 0.3% accuracy is worth 2× the inference cost. NAS amplifies expertise; it doesn't substitute for it.
References & Further Reading
- Zoph & Le — Neural Architecture Search with Reinforcement Learning (2017) — the seminal NAS paper, 800 GPUs for 28 days
- Zoph et al. — Learning Transferable Architectures for Scalable Image Recognition (2018) — NASNet, cell-based search spaces
- Liu et al. — DARTS: Differentiable Architecture Search (2019) — continuous relaxation, bilevel optimization
- Pham et al. — Efficient Neural Architecture Search via Parameter Sharing (2018) — ENAS, weight sharing
- Real et al. — Regularized Evolution for Image Classifier Architecture Search (2019) — AmoebaNet, evolutionary NAS
- Tan et al. — MnasNet: Platform-Aware Neural Architecture Search for Mobile (2019) — hardware-aware multi-objective NAS
- Tan & Le — EfficientNet: Rethinking Model Scaling for CNNs (2019) — compound scaling from NAS-found base
- Li & Talwalkar — Random Search and Reproducibility for NAS (2020) — random search as a strong baseline
- Mellor et al. — Neural Architecture Search without Training (2021) — zero-cost NASWOT proxy
- Elsken et al. — Neural Architecture Search: A Survey (2019) — comprehensive overview of the field