Program Synthesis from Scratch: Teaching Machines to Write Code
1. The Program Synthesis Problem
Every time you ask ChatGPT to write a Python function, you're invoking a program synthesizer — a system that takes a specification of what code should do and produces code that does it. But LLMs are just the latest chapter in a 50-year quest to automate programming. The deeper question: given a precise specification, can we systematically search the space of all possible programs to find one that is provably correct?
This is program synthesis, and its difficulty stems from a brutal combinatorial reality. Programs are discrete, structured objects where a single character change can flip a correct program to a wrong one. Unlike continuous optimization — where gradient descent navigates smooth loss landscapes — synthesis must search a jagged, discontinuous space where nearby points in syntax are arbitrarily far apart in behavior.
There are three flavors of specification:
- Input-output examples (Programming by Example): the user provides 3–5 examples like
(3, 7),(5, 11), and the system generalizes tof(x) = 2x + 1. - Logical specifications: pre/post-conditions like “for all x: if x > 0 then P(x) > 0 and P(x) < x.”
- Natural language: the LLM paradigm — inherently ambiguous, but the most accessible.
To make synthesis tractable, we restrict the search space with a Domain-Specific Language (DSL) — a grammar of allowed operations. A Turing-complete language makes synthesis undecidable (Rice’s theorem), but a DSL with 10 operations at depth 5 produces roughly 105 candidates. At depth 10, that’s 1010. The art of synthesis is in how you constrain, guide, and prune that search.
Every candidate must be executed against the examples to check correctness. Synthesis = search + execution, and both are expensive. If the concepts of searching over discrete structures ring a bell, they should — genetic algorithms tackle the same fundamental problem, and symbolic regression is literally program synthesis over arithmetic DSLs using evolutionary search.
2. Enumerative Search — Brute Force Over Programs
The simplest synthesis strategy is also the most honest: enumerate every program from smallest to largest, execute each on the examples, and return the first one that works. No heuristics, no learning — just raw exhaustion.
We define a minimal integer arithmetic DSL as an abstract syntax tree (AST):
- Terminals: constants (
0,1,2) and variables (x,y) - Operations:
+,*,-, and a conditionalif a == b then c else d
Programs are expression trees. We enumerate by depth: depth 1 gives us just constants and variables. Depth 2 applies one operation to depth-1 subexpressions. Depth 3 composes further. The key optimization is observational equivalence pruning: if two programs produce identical outputs on all provided inputs, they’re interchangeable for synthesis purposes. We keep only one representative per equivalence class. This means x + 0 collapses to x, 1 + 1 collapses to 2, and the effective search space shrinks dramatically.
import itertools
# AST node types for our integer DSL
class Const:
def __init__(self, v): self.v = v
def eval(self, env): return self.v
def __repr__(self): return str(self.v)
class Var:
def __init__(self, name): self.name = name
def eval(self, env): return env[self.name]
def __repr__(self): return self.name
class BinOp:
def __init__(self, op, l, r): self.op, self.l, self.r = op, l, r
def eval(self, env):
a, b = self.l.eval(env), self.r.eval(env)
if self.op == '+': return a + b
if self.op == '*': return a * b
if self.op == '-': return a - b
def __repr__(self): return f"({self.l} {self.op} {self.r})"
def enumerate_programs(max_depth, var_names, ops=['+', '*', '-']):
"""Bottom-up enumeration with observational equivalence pruning."""
test_envs = [{v: i+1 for i, v in enumerate(var_names)} for _ in range(1)]
test_envs += [{v: i*3+2 for i, v in enumerate(var_names)}] # two probe envs
seen_outputs = set()
def signature(prog):
try: return tuple(prog.eval(e) for e in test_envs)
except: return None
def add_if_new(prog, bank):
sig = signature(prog)
if sig is not None and sig not in seen_outputs:
seen_outputs.add(sig)
bank.append(prog)
# Depth 1: constants and variables
bank = []
for c in [0, 1, 2]: add_if_new(Const(c), bank)
for v in var_names: add_if_new(Var(v), bank)
yield from bank
prev_banks = [bank]
for depth in range(2, max_depth + 1):
new_bank = []
all_prev = [p for b in prev_banks for p in b]
for op in ops:
for l, r in itertools.product(all_prev, repeat=2):
add_if_new(BinOp(op, l, r), new_bank)
prev_banks.append(new_bank)
yield from new_bank
def synthesize(examples, var_names=['x'], max_depth=4, ops=['+', '*', '-']):
"""Find the simplest program consistent with all I/O examples."""
count = 0
for prog in enumerate_programs(max_depth, var_names, ops):
count += 1
if all(prog.eval({var_names[0]: x}) == y for x, y in examples):
return prog, count
return None, count
# Example: synthesize f(x) = 2x + 1 from three examples
prog, n = synthesize([(1, 3), (3, 7), (5, 11)])
print(f"Found: f(x) = {prog} (explored {n} candidates)")
# Found: f(x) = (1 + (x + x)) (explored 20 candidates)
This works — and for simple functions, it’s surprisingly fast. But the exponential blowup is real: depth 3 takes milliseconds, depth 5 takes seconds, and by depth 7 we’re looking at millions of candidates even with pruning. We need something smarter.
3. Constraint-Based Synthesis — CEGIS
The key insight behind constraint-based synthesis is to reformulate the problem as a logical query: “Does there exist a program P in the DSL such that for all examples (xi, yi), eval(P, xi) = yi?” This is a ∃∀ formula — existentially quantify over program space, universally quantify over inputs.
Solving the full ∃∀ problem directly is expensive. CEGIS (Counter-Example Guided Inductive Synthesis, Solar-Lezama 2008) offers an elegant alternative: an iterative loop that avoids the universal quantifier entirely.
- Start with a small subset of examples.
- Synthesize: find any program P consistent with the current examples (using enumeration, an SMT solver, or any other method).
- Verify: check if P satisfies the full specification. If yes, return P.
- Refine: the verifier produces a counterexample — an input x* where P(x*) ≠ spec(x*). Add x* to the example set and go to step 2.
CEGIS converges because each iteration adds a genuinely new constraint. In practice, most specifications are captured by 5–10 critical examples, so synthesis completes after just a handful of rounds — even when the DSL contains millions of candidate programs.
The most famous CEGIS success story is FlashFill (Gulwani 2011), which ships in Microsoft Excel. When you type a string transformation in one column and Excel autocompletes the rest, that’s program synthesis — FlashFill infers a program over a string DSL (concatenation, substring extraction, case conversion, regex matching) from just 1–3 examples.
import random
def cegis_synthesize(ground_truth, var_names=['x'], max_depth=4,
ops=['+', '*', '-'], seed=42):
"""CEGIS loop: synthesize, verify, add counterexamples, repeat."""
rng = random.Random(seed)
# Start with a small seed set of examples (just 2 — ambiguous!)
examples = [(0, ground_truth(0)), (1, ground_truth(1))]
round_num = 0
total_candidates = 0
while True:
round_num += 1
# SYNTHESIZE: find a program consistent with current examples
prog, count = synthesize(examples, var_names, max_depth, ops)
total_candidates += count
if prog is None:
return None, round_num, total_candidates
# VERIFY: test against many random inputs
counterexample = None
for _ in range(200):
x = rng.randint(-50, 50)
expected = ground_truth(x)
try:
actual = prog.eval({'x': x})
except:
counterexample = x
break
if actual != expected:
counterexample = x
break
if counterexample is None:
# No counterexample found — program is likely correct
return prog, round_num, total_candidates
# REFINE: add the counterexample and loop
cx = counterexample
try:
got = prog.eval({'x': cx})
except:
got = "error"
examples.append((cx, ground_truth(cx)))
print(f" Round {round_num}: found {prog}, "
f"counterexample x={cx} (expected {ground_truth(cx)}, "
f"got {got})")
# Synthesize f(x) = x^2 + 1 via CEGIS
target = lambda x: x * x + 1
prog, rounds, candidates = cegis_synthesize(target)
print(f"\nCEGIS result: f(x) = {prog}")
print(f"Converged in {rounds} rounds, {candidates} total candidates")
# Round 1: found (1 + x), counterexample x=31 (expected 962, got 32)
# CEGIS result: f(x) = (1 + (x * x))
# Converged in 2 rounds, 27 total candidates
Compare that to pure enumeration, which might explore thousands of candidates against all examples simultaneously. CEGIS uses just 2 rounds and 27 total evaluations to find x² + 1 — the first round finds x + 1 (correct for the seed examples but wrong in general), the verifier catches it with a counterexample at x = 31, and the second round nails the quadratic.
FlashFill-Style String Synthesis
The same CEGIS principle applies to string transformations. Here’s a simplified string DSL that captures common text manipulations:
# String DSL: template-based candidates for text transformation
TEMPLATES = [
("first_word", lambda s: s.split(" ")[0]),
("last_word", lambda s: s.split(" ")[-1]),
("initial_dot_last", lambda s: s[0] + ". " + s.split(" ")[-1]),
("upper_first", lambda s: s.split(" ")[0].upper()),
("last_comma_first", lambda s: s.split(" ")[-1] + ", " + s.split(" ")[0]),
("first_initial", lambda s: s[0] + "."),
("initials", lambda s: ". ".join(w[0] for w in s.split(" ")) + "."),
("swap_words", lambda s: " ".join(s.split(" ")[::-1])),
("lower_all", lambda s: s.lower()),
("title_case", lambda s: s.title()),
]
def synth_string(examples):
"""Find the first template consistent with all I/O examples."""
for name, fn in TEMPLATES:
if all(fn(inp) == out for inp, out in examples):
return name, fn
return None, None
# "John Smith" -> "J. Smith", "Jane Doe" -> "J. Doe"
examples = [("John Smith", "J. Smith"), ("Jane Doe", "J. Doe")]
name, fn = synth_string(examples)
print(f"Synthesized: {name}")
print(f"Test: 'Alice Cooper' -> '{fn('Alice Cooper')}'")
# Synthesized: initial_dot_last
# Test: 'Alice Cooper' -> 'A. Cooper'
Real FlashFill uses a much richer DSL with regex matching and positional indexing, but the principle is identical: enumerate candidate programs from a structured grammar, verify against examples, and return the first consistent match. With just 2 examples, the correct transformation is disambiguated from all 10 templates.
Try It: Program Synthesizer
Enter input-output pairs and watch enumeration find the simplest matching program in a small arithmetic DSL.
x = → y = x = → y = x = → y = x = → y = 4. Neural-Guided Synthesis — Learning to Search
Constraint-based synthesis works beautifully for small DSLs, but what if the DSL has hundreds of operations, or the target program is deeply nested? Enumeration — even CEGIS-accelerated — hits a wall. Neural-guided synthesis uses a trained neural network to predict which program fragments are likely to satisfy the specification, pruning the search tree before we even start enumerating.
DeepCoder (Balog et al. 2017) pioneered this approach: train a classifier on thousands of (input-output examples, correct program) pairs. The network learns to predict which DSL operations appear in the target program. At synthesis time, if the network says “this task probably needs sort and filter,” we try programs containing those operations first — reducing search time by 10–100x.
RobustFill (Devlin et al. 2017) went further: an encoder-decoder architecture takes I/O examples as input and directly generates program tokens autoregressively — essentially a specialized language model trained on synthesis problems.
And here’s the punchline: GPT-4 and Claude writing Python from natural language descriptions is neural program synthesis. The “DSL” is Python, the “specification” is the prompt, and the model learned to predict likely programs from a massive code training corpus. The difference from DeepCoder: LLMs use natural language specs (not I/O examples), generate in a Turing-complete language (not a restricted DSL), and have no formal correctness guarantee.
AlphaCode (Li et al. 2022) showed how to compensate: generate roughly one million candidate programs, cluster them by behavior, execute against test cases, and return the most common correct cluster. This is neural-guided enumeration at breathtaking scale.
import numpy as np
# A simple neural-guided synthesizer: train an MLP to predict
# which DSL operations a target program likely uses, then
# prioritize enumeration accordingly.
def generate_training_data(n_tasks=5000, seed=42):
"""Generate synthetic (I/O examples -> ops used) training pairs."""
rng = np.random.RandomState(seed)
all_ops = ['+', '*', '-']
X, Y = [], []
for _ in range(n_tasks):
# Random program: pick 1-2 ops, build a small expression
ops_used = list(rng.choice(all_ops, size=rng.randint(1, 3), replace=False))
a, b = rng.randint(1, 5), rng.randint(0, 4)
if '+' in ops_used and '*' in ops_used:
fn = lambda x, a=a, b=b: a * x + b
elif '*' in ops_used:
fn = lambda x, a=a, b=b: a * x * b
elif '-' in ops_used:
fn = lambda x, a=a, b=b: a - x + b
else:
fn = lambda x, a=a, b=b: x + a + b
# Generate I/O feature vector: outputs for x = 0..4
io_features = [fn(x) for x in range(5)]
X.append(io_features)
Y.append([1 if op in ops_used else 0 for op in all_ops])
return np.array(X, dtype=float), np.array(Y, dtype=float)
def train_guide(X, Y, lr=0.05, epochs=1000):
"""Train a tiny MLP: 5 inputs -> 16 hidden -> 3 outputs (sigmoid)."""
rng = np.random.RandomState(0)
mu, sigma = X.mean(axis=0), X.std(axis=0) + 1e-8
X = (X - mu) / sigma # normalize
W1 = rng.randn(5, 16) * 0.3
b1 = np.zeros(16)
W2 = rng.randn(16, 3) * 0.3
b2 = np.zeros(3)
for _ in range(epochs):
h = np.maximum(0, X @ W1 + b1) # ReLU hidden
logits = h @ W2 + b2
pred = 1 / (1 + np.exp(-logits)) # sigmoid output
grad_out = (pred - Y) / len(X)
grad_W2 = h.T @ grad_out
grad_b2 = grad_out.sum(axis=0)
grad_h = grad_out @ W2.T
grad_h[h <= 0] = 0
W2 -= lr * grad_W2; b2 -= lr * grad_b2
W1 -= lr * X.T @ grad_h; b1 -= lr * grad_h.sum(axis=0)
return lambda io: 1/(1+np.exp(-(np.maximum(0, ((np.array(io)-mu)
/sigma) @ W1+b1) @ W2+b2)))
X, Y = generate_training_data()
predict_ops = train_guide(X, Y)
# Test: for f(x) = 2x + 1, I/O is [1, 3, 5, 7, 9]
scores = predict_ops([1, 3, 5, 7, 9])
ops = ['+', '*', '-']
ranked = sorted(zip(ops, scores), key=lambda t: -t[1])
print("Predicted op priorities:", [(op, f"{s:.2f}") for op, s in ranked])
# Predicted op priorities: [('*', 0.95), ('+', 0.69), ('-', 0.14)]
# -> Prioritize programs using + and * first, skip - heavy programs
The neural guide doesn’t replace enumeration — it reorders it. By trying the most likely operations first, we find the correct program in a fraction of the candidates. In practice, DeepCoder achieved 5–20x speedups on its benchmark, and the principle extends to much larger DSLs.
5. Program Synthesis Meets LLMs
Traditional synthesis requires precise specifications — input-output examples, logical formulas. LLM synthesis flips this: the spec is natural language, which makes synthesis accessible to everyone but introduces a fundamental problem: specification ambiguity. “Sort this list” — ascending or descending? Stable or unstable? In-place or returning a new copy?
The bigger issue is the verification gap. CEGIS verifies against a formal spec and can prove correctness. LLM-generated code has no formal spec to verify against. The emerging solution looks remarkably like CEGIS with an LLM as the synthesizer:
- Generate: the LLM produces candidate code from a natural language prompt.
- Test: execute the code against test cases (the tests serve as the formal spec).
- Repair: if tests fail, feed the error message back to the LLM and ask it to fix the code.
- Repeat until all tests pass or a budget is exhausted.
This is exactly the self-repair loop described in Reflexion (Shinn et al. 2023) and practiced by every developer who pastes an error into ChatGPT. The HumanEval benchmark (Chen et al. 2021) quantifies performance: on 164 Python problems with test suites, GPT-4 achieves ~67% pass@1 (first attempt correct), while AlphaCode reaches ~78% pass@100 (at least one of 100 attempts correct). Generate-and-filter is powerful but sample-inefficient.
# The LLM-as-Synthesizer self-repair loop
# (Uses a mock LLM for reproducibility — swap in an API call for real use)
def mock_llm(prompt, attempt=0):
"""Simulates LLM code generation with intentional first-attempt bugs."""
if "absolute difference" in prompt:
if attempt == 0:
return "def solve(a, b): return a - b" # Bug: no abs()
return "def solve(a, b): return abs(a - b)" # Fixed
if "fibonacci" in prompt:
if attempt == 0:
return "def solve(n):\n if n <= 1: return n\n return solve(n-1) + solve(n-2)"
return "def solve(n):\n a, b = 0, 1\n for _ in range(n): a, b = b, a+b\n return a"
return "def solve(x): return x"
def run_tests(code, tests):
"""Execute code against test cases, return (pass, error_msg)."""
namespace = {}
try:
exec(code, namespace)
except Exception as e:
return False, f"Syntax error: {e}"
fn = namespace.get('solve')
for inputs, expected in tests:
try:
result = fn(*inputs) if isinstance(inputs, tuple) else fn(inputs)
except Exception as e:
return False, f"Runtime error on input {inputs}: {e}"
if result != expected:
return False, f"Wrong answer: solve({inputs}) = {result}, expected {expected}"
return True, "All tests passed"
def synthesis_loop(spec, tests, max_attempts=3):
"""CEGIS-style loop with LLM as the synthesizer."""
error_context = ""
for attempt in range(max_attempts):
prompt = f"Write a function `solve` that: {spec}"
if error_context:
prompt += f"\nPrevious attempt failed: {error_context}\nFix the bug."
code = mock_llm(prompt, attempt)
print(f"Attempt {attempt+1}: {code.split(chr(10))[0]}...")
passed, msg = run_tests(code, tests)
print(f" Result: {msg}")
if passed:
return code, attempt + 1
error_context = msg
return None, max_attempts
# Synthesize "absolute difference" with self-repair
code, attempts = synthesis_loop(
"computes the absolute difference of two numbers",
[((5, 3), 2), ((3, 7), 4), ((0, 0), 0), ((-2, 3), 5)]
)
print(f"\nSynthesized in {attempts} attempt(s)")
# Attempt 1: def solve(a, b): return a - b...
# Result: Wrong answer: solve((3, 7)) = -4, expected 4
# Attempt 2: def solve(a, b): return abs(a - b)...
# Result: All tests passed
# Synthesized in 2 attempt(s)
The mock LLM intentionally produces a bug on the first attempt (missing abs()), receives the failing test as feedback, and generates the correct solution on attempt 2. This is exactly how modern coding assistants work under the hood — what changes is the scale and sophistication of the LLM, not the loop structure.
Try It: CEGIS Visualizer
Watch counter-example guided synthesis converge. Each step synthesizes a program from current examples, then adds a counterexample where it fails.
6. The Frontier — Neurosymbolic Synthesis and Beyond
The most exciting direction in program synthesis is learning to synthesize better over time. DreamCoder (Ellis et al. 2021) exemplifies this: after solving a batch of synthesis tasks, it abstracts common program fragments into new DSL operations. A system that solves 100 list-manipulation tasks might discover map, filter, and fold as reusable components — and then use these higher-level operations to solve harder problems in fewer steps. This mirrors how human programmers build up abstractions over their careers.
Program synthesis for science has produced remarkable results. Eureqa (Schmidt & Lipson 2009) rediscovered Newton’s laws and conservation of energy from raw data by treating physical law discovery as synthesis over an arithmetic DSL — the same approach explored in symbolic regression from scratch, which uses genetic programming to evolve expression trees.
The comparison table below shows how the three paradigms compare on key dimensions:
| Approach | Spec Type | Correctness | Scalability | DSL Size |
|---|---|---|---|---|
| Enumerative | I/O examples | Guaranteed (exhaustive) | Depth ≤ 5–7 | Small (< 20 ops) |
| CEGIS | I/O + verifier | Guaranteed (with verifier) | Depth ≤ 8–12 | Medium (< 50 ops) |
| Neural-guided | I/O examples | Probabilistic | Depth ≤ 15–20 | Large (100+ ops) |
| LLM | Natural language | No guarantee | Arbitrary length | Turing-complete |
The open frontiers are tantalizing: scaling synthesis to programs beyond ~50 lines (current methods struggle), handling stateful programs with side effects (I/O, databases, network calls), synthesizing concurrent programs (where interleaving creates exponential behavior), and the grand challenge — closing the loop between natural language understanding and formal verification so LLM-generated code can be proved correct.
7. Conclusion
Program synthesis is the deep thread connecting symbolic AI’s dream of automatic programming (Manna & Waldinger 1971) to the neural era’s practical code generation (GitHub Copilot, 2021). The three paradigms we built — enumerative search, constraint-based CEGIS, and neural guidance — aren’t competitors. They’re layers: neural models prune the search space, CEGIS-style loops verify and refine with counterexamples, and enumerative completeness guarantees you eventually find a solution if one exists.
Every time you use Copilot or ask an LLM to write code, you’re running a program synthesizer. The test cases you write are CEGIS counterexamples. The error messages you paste back are verification feedback. Understanding these foundations doesn’t just help you build better AI systems — it helps you be a better collaborator with the AI systems you already use every day.
References & Further Reading
- Gulwani (2011) — Automating String Processing in Spreadsheets Using Input-Output Examples — the FlashFill paper that put program synthesis in every copy of Excel
- Solar-Lezama (2008) — Program Synthesis by Sketching — introduced CEGIS and the sketch-based synthesis paradigm
- Balog et al. (2017) — DeepCoder: Learning to Write Programs — neural-guided synthesis that learns to predict DSL operations from I/O
- Devlin et al. (2017) — RobustFill: Neural Program Learning Under Noisy I/O — encoder-decoder architecture for string program synthesis
- Ellis et al. (2021) — DreamCoder: Growing Generalizable, Interpretable Knowledge with Wake-Sleep Bayesian Program Learning — library learning for progressively harder synthesis tasks
- Chen et al. (2021) — Evaluating Large Language Models Trained on Code — the HumanEval benchmark and Codex
- Li et al. (2022) — Competition-Level Code Generation with AlphaCode — massive generate-and-filter at competition scale
- Manna & Waldinger (1971) — Toward Automatic Program Synthesis — one of the founding papers of the entire field