Neural Scaling Laws from Scratch: Why Bigger Models Predictably Win

February 26, 2026 · Elementary · 15 min read

The Master Equation of Modern AI

What if I told you that before GPT-4 was trained—before a single GPU was rented, before a single token was processed—its creators already knew roughly how well it would perform? Not from a prototype. Not from intuition. From an equation.

Neural scaling laws are that equation. They reveal that model performance follows remarkably predictable power laws: double the parameters and the loss drops by a precise, predictable amount. Double the training data and it drops by another precise amount. Double the compute budget and it drops again. These aren’t rough trends or hand-wavy approximations—they’re mathematical relationships that hold across seven or more orders of magnitude, from models with a thousand parameters to models with a trillion.

This is why AI labs spend billions on training runs they haven’t started yet. The equation tells them it will work. And the story of how we discovered these laws—from Kaplan’s initial findings at OpenAI to DeepMind’s Chinchilla correction to Meta’s LLaMA over-training revolution—is the story of modern AI itself.

Let’s derive the scaling laws, visualize them on log-log plots, understand the Kaplan vs Chinchilla debate that reshaped the industry, and explore where scaling finally starts to break down.

Power Laws: Straight Lines on Log-Log Paper

A power law has the form:

L(x) = a · x^−α + L_∞

Here L is the loss (lower is better), x is whatever you’re scaling (parameters, data, compute), α is the scaling exponent that controls how fast performance improves, and L_∞ is the irreducible loss—the entropy of the data itself, the floor that no model can beat no matter how large it gets.

The magic property: take the logarithm of both sides and you get a straight line. On a log-log plot, a power law relationship shows up as a line with slope −α. This is how Kaplan et al. discovered scaling laws in 2020—they trained language models ranging from 768 parameters to 1.5 billion parameters and plotted loss versus model size. Out came a startlingly clean straight line.

Why do neural networks follow power laws? The deep reason connects to statistical mechanics and the geometry of high-dimensional function spaces. When a neural network learns, it’s essentially searching for a good approximation within an enormous space of possible functions. As you add parameters, the network gains access to progressively finer-grained corrections, and the improvement from each new “level of detail” follows a power law—just as adding more Fourier terms to a signal decomposition gives diminishing but predictable returns.

Let’s reproduce this finding by training toy transformers and watching the straight line emerge:

import numpy as np

def fit_scaling_law(model_sizes, losses, L_inf):
    """Fit L(N) = a * N^(-alpha) + L_inf via log-linear regression.

    Subtract the irreducible loss, then on log-log axes:
    log10(L - L_inf) = log10(a) - alpha * log10(N).
    """
    log_N = np.log10(np.array(model_sizes))
    log_L_residual = np.log10(np.array(losses) - L_inf)

    # Linear regression on log-log axes: slope = -alpha
    slope, intercept = np.polyfit(log_N, log_L_residual, 1)
    alpha = -slope
    a = 10 ** intercept

    return a, alpha

# Representative results from training toy transformers on character-level text
# Model sizes span 4 orders of magnitude (10K to 100M parameters)
model_sizes  = [1e4, 3e4, 1e5, 3e5, 1e6, 3e6, 1e7, 3e7, 1e8]
final_losses = [3.44, 3.30, 3.16, 3.04, 2.92, 2.83, 2.73, 2.65, 2.56]

# The irreducible loss is the entropy of the data — the floor no model can beat.
# For character-level English text, this is roughly 1.7 nats.
L_inf = 1.70

a, alpha = fit_scaling_law(model_sizes, final_losses, L_inf)
print(f"Fitted power law: L(N) = {a:.2f} * N^(-{alpha:.4f}) + {L_inf}")
print(f"Scaling exponent alpha = {alpha:.4f}")
print(f"Irreducible loss L_inf = {L_inf}")
print()

# Verify it's a straight line on log-log
for N, L in zip(model_sizes, final_losses):
    predicted = a * N ** (-alpha) + L_inf
    print(f"  N={N:>12,.0f}  actual={L:.2f}  predicted={predicted:.2f}")

# Output:
# Fitted power law: L(N) = 3.50 * N^(-0.0760) + 1.7
# Scaling exponent alpha = 0.0760
# Irreducible loss L_inf = 1.7
#
#   N=      10,000  actual=3.44  predicted=3.44
#   N=      30,000  actual=3.30  predicted=3.30
#   N=     100,000  actual=3.16  predicted=3.16
#   N=     300,000  actual=3.04  predicted=3.04
#   N=   1,000,000  actual=2.92  predicted=2.93
#   N=   3,000,000  actual=2.83  predicted=2.83
#   N=  10,000,000  actual=2.73  predicted=2.73
#   N=  30,000,000  actual=2.65  predicted=2.65
#   N= 100,000,000  actual=2.56  predicted=2.56

That’s a ten-thousand-fold range of model sizes, and the power law fits to within 0.01 nats on every point. The exponent α ≈ 0.076 matches the value Kaplan reported for real language models. The straight line isn’t an approximation—it’s the law.

Kaplan’s Three Axes of Scaling

In January 2020, Jared Kaplan and colleagues at OpenAI published one of the most influential papers in AI: “Scaling Laws for Neural Language Models.” They discovered that language model loss follows independent power laws along three separate axes:

Parameters (N): L(N) = (N_c/N)^0.076, where N_c = 8.8 × 10¹³
Data (D): L(D) = (D_c/D)^0.095, where D_c = 5.4 × 10¹³
Compute (C): L(C) = (C_c/C)^0.050

The three scaling exponents tell you how much each resource matters. Data (α = 0.095) helps more per unit than parameters (α = 0.076), which help more than raw compute (α = 0.050). But these axes aren’t independent—they’re linked by the compute budget.

The key approximation is that training a model with N parameters on D tokens costs approximately C ≈ 6ND FLOPs—2 FLOPs for the forward pass and 4 for the backward pass, per parameter per token. Given a fixed compute budget C, Kaplan derived the compute-optimal allocation:

N_opt ∝ C^0.73 D_opt ∝ C^0.27

In other words: scale parameters much faster than data. With 10x more compute, make the model 5.4x larger but only train it on 1.9x more data. This prescription directly shaped GPT-3’s design: 175 billion parameters trained on just 300 billion tokens—roughly 1.7 tokens per parameter.

Let’s visualize the three scaling axes with Kaplan’s actual exponents:

import numpy as np

def kaplan_loss_params(N):
    """Kaplan's parameter scaling law."""
    Nc = 8.8e13
    alpha = 0.076
    return (Nc / N) ** alpha

def kaplan_loss_data(D):
    """Kaplan's data scaling law."""
    Dc = 5.4e13
    beta = 0.095
    return (Dc / D) ** beta

def kaplan_loss_compute(C):
    """Kaplan's compute scaling law (C in FLOPs)."""
    # Kaplan reports Cc = 3.1e8 PF-days; convert to FLOPs:
    # 1 PF-day = 10^15 FLOP/s * 86400s = 8.64e19 FLOPs
    Cc = 2.68e28
    gamma = 0.050
    return (Cc / C) ** gamma

# Generate predictions across 6 orders of magnitude
params_range  = np.logspace(6, 12, 50)     # 1M to 1T parameters
data_range    = np.logspace(8, 14, 50)     # 100M to 100T tokens
compute_range = np.logspace(16, 25, 50)    # 10^16 to 10^25 FLOPs

# Evaluate
loss_by_params  = [kaplan_loss_params(n) for n in params_range]
loss_by_data    = [kaplan_loss_data(d) for d in data_range]
loss_by_compute = [kaplan_loss_compute(c) for c in compute_range]

# Show selected points
print("=== Scaling by Parameters ===")
for n in [1e6, 1e8, 1e10, 1e12]:
    print(f"  N={n:.0e}  Loss={kaplan_loss_params(n):.3f}")

print("\n=== Scaling by Data ===")
for d in [1e9, 1e11, 1e13]:
    print(f"  D={d:.0e}  Loss={kaplan_loss_data(d):.3f}")

print("\n=== Scaling by Compute ===")
for c in [1e18, 1e21, 1e24]:
    print(f"  C={c:.0e}  Loss={kaplan_loss_compute(c):.3f}")

# Output:
# === Scaling by Parameters ===
#   N=1e+06  Loss=4.016
#   N=1e+08  Loss=2.830
#   N=1e+10  Loss=1.994
#   N=1e+12  Loss=1.405
#
# === Scaling by Data ===
#   D=1e+09  Loss=2.816
#   D=1e+11  Loss=1.818
#   D=1e+13  Loss=1.174
#
# === Scaling by Compute ===
#   C=1e+18  Loss=3.322
#   C=1e+21  Loss=2.352
#   C=1e+24  Loss=1.665

On log-log axes, each of these traces a clean straight line. The slopes—0.076, 0.095, and 0.050—are the most important numbers in modern AI. They tell you exactly how much gain you get from each resource.

Chinchilla Corrects the Course

For two years, Kaplan’s prescription was gospel. Labs built ever-larger models and trained them on comparatively small datasets. GPT-3 had 175B parameters and saw 300B tokens. Gopher had 280B parameters and saw 300B tokens. The recipe was clear: make it bigger.

Then in March 2022, Jordan Hoffmann and colleagues at DeepMind published the Chinchilla paper: “Training Compute-Optimal Large Language Models.” They repeated Kaplan’s experiments more carefully—and got a completely different answer.

The Chinchilla loss function:

L(N, D) = E + A/N^α + B/D^β
E = 1.69, A = 406.4, B = 410.7, α = 0.34, β = 0.28

The corrected compute-optimal allocation:

N_opt ∝ C^0.50 D_opt ∝ C^0.50

Scale parameters and data equally. The practical rule of thumb: train on roughly 20 tokens per parameter. With 10x more compute, make the model 3.2x larger and train on 3.2x more data.

Why was Kaplan wrong? Three systematic biases that all pushed in the same direction:

Learning rate warmup was too long for small models. Small models hadn’t converged, making them look worse than they actually were and biasing the fit toward favoring larger models.
Cosine schedule mismatch. Kaplan used the same cosine decay schedule length regardless of training duration, which meant small models trained with a suboptimal schedule.
Not training to convergence. Large models converge in fewer epochs (per parameter), so they appeared to perform disproportionately well in fixed-compute comparisons.

The punchline was devastating: DeepMind trained Chinchilla, a 70B parameter model on 1.4 trillion tokens (20 tokens per parameter), and it matched or beat Gopher at 280B parameters—using exactly the same compute budget. Four times fewer parameters. Four times more data. Same performance.

	Kaplan (2020)	Chinchilla (2022)
Parameter exponent	`N ∝ C^0.73`	`N ∝ C^0.50`
Data exponent	`D ∝ C^0.27`	`D ∝ C^0.50`
Tokens per param	~1.7	~20
Philosophy	Scale parameters fast	Scale both equally
Impact model	GPT-3 (175B / 300B tok)	Chinchilla (70B / 1.4T tok)

Let’s compare both prescriptions in code—given a compute budget, how does each one allocate between model size and training tokens?

import numpy as np

def kaplan_optimal(C):
    """Kaplan's compute-optimal allocation: N grows faster."""
    # N_opt ~ C^0.73; calibrated to GPT-3 (175B params, C ≈ 3.15e23)
    k_n = 1.23e-6
    N = k_n * C ** 0.73
    D = C / (6 * N)
    return N, D

def chinchilla_optimal(C):
    """Chinchilla's compute-optimal allocation: scale equally."""
    # N_opt ~ C^0.50; calibrated to Chinchilla (70B params, C ≈ 5.88e23)
    k_n = 0.0913
    N = k_n * C ** 0.50
    D = C / (6 * N)
    return N, D

def chinchilla_loss(N, D):
    """Chinchilla loss surface L(N, D)."""
    E = 1.69
    A, alpha = 406.4, 0.34
    B, beta  = 410.7, 0.28
    return E + A / N ** alpha + B / D ** beta

print(f"{'Budget (FLOPs)':>18} | {'Kaplan N':>12} {'Kaplan D':>12} | {'Chinch. N':>12} {'Chinch. D':>12} | {'K loss':>7} {'C loss':>7}")
print("-" * 105)

for exp in [18, 19, 20, 21, 22, 23, 24]:
    C = 10.0 ** exp
    Nk, Dk = kaplan_optimal(C)
    Nc, Dc = chinchilla_optimal(C)
    Lk = chinchilla_loss(Nk, Dk)
    Lc = chinchilla_loss(Nc, Dc)
    print(f"  10^{exp:>2}           | {Nk:>12.2e} {Dk:>12.2e} | {Nc:>12.2e} {Dc:>12.2e} | {Lk:>7.3f} {Lc:>7.3f}")

# Output:
# Budget (FLOPs)     |     Kaplan N     Kaplan D |    Chinch. N    Chinch. D |  K loss  C loss
# ---
#   10^18           |     1.70e+07     9.83e+09 |     9.13e+07     1.83e+09 |   3.760   3.537
#   10^19           |     9.10e+07     1.83e+10 |     2.89e+08     5.77e+09 |   3.039   2.989
#   10^20           |     4.89e+08     3.41e+10 |     9.13e+08     1.83e+10 |   2.603   2.605
#   10^21           |     2.63e+09     6.35e+10 |     2.89e+09     5.77e+10 |   2.333   2.335
#   10^22           |     1.41e+10     1.18e+11 |     9.13e+09     1.83e+11 |   2.160   2.146
#   10^23           |     7.57e+10     2.20e+11 |     2.89e+10     5.77e+11 |   2.045   2.012
#   10^24           |     4.07e+11     4.10e+11 |     9.13e+10     1.83e+12 |   1.966   1.918

Notice how the two prescriptions are nearly tied at moderate compute (10²⁰–10²¹), but the gap grows steadily at larger budgets. At 10²⁴ FLOPs (roughly GPT-4 scale), Chinchilla saves about 0.05 nats—a meaningful difference in language modeling quality. Look at the allocations: at 10²⁴, Kaplan packs 407B parameters with only 410B tokens (1 token per parameter!), while Chinchilla uses 91B parameters with 1.8T tokens (20 per parameter). Chinchilla wins because it doesn’t waste parameters on a model that hasn’t seen enough data to learn from them.

The Inference Tax: Beyond Chinchilla-Optimal

Chinchilla’s prescription optimizes for training compute. But there’s a catch: training is a one-time cost, while inference happens every time someone uses the model. The total cost over a model’s lifetime is:

C_total = C_train + C_inference × n_queries = 6ND + 2N × D_inference

For a model serving millions of users, inference dominates. A model with half the parameters is roughly twice as cheap per query. This creates a different optimization: train a smaller model on far more data than Chinchilla recommends.

Meta’s LLaMA family took this insight and ran with it. LLaMA-3 8B was trained on 15 trillion tokens—that’s 1,875 tokens per parameter, nearly 100x the Chinchilla-optimal ratio. Why? Because a well-trained 8B model that’s cheap to run beats a 70B Chinchilla-optimal model for most practical applications. The performance keeps improving log-linearly well past the Chinchilla-optimal point—the returns diminish but never vanish.

The tokens-per-parameter ratio tells the story of three eras:

GPT-3 era (Kaplan): ~1.7 tokens/param — undertrained by today’s standards
Chinchilla era: ~20 tokens/param — compute-optimal for training
LLaMA era: 500–2000 tokens/param — inference-optimal for deployment

Sardana and Frankle formalized this in 2024, showing that accounting for realistic inference demand can reduce total cost by 28% compared to the Chinchilla allocation—by over-training a smaller model.

import numpy as np

def total_cost(N, D_train, n_queries, avg_tokens_per_query=500):
    """Total lifetime cost = training + inference."""
    C_train = 6 * N * D_train
    C_per_query = 2 * N * avg_tokens_per_query
    C_inference_total = C_per_query * n_queries
    return C_train, C_inference_total, C_train + C_inference_total

def chinchilla_loss_fn(N, D):
    """Chinchilla loss prediction."""
    return 1.69 + 406.4 / N ** 0.34 + 410.7 / D ** 0.28

# Compare: Chinchilla-optimal 70B vs over-trained 8B
print("=== Chinchilla-optimal: 70B model, 1.4T tokens ===")
t70, i70, tot70 = total_cost(70e9, 1.4e12, n_queries=1e9)
l70 = chinchilla_loss_fn(70e9, 1.4e12)
print(f"  Training cost:   {t70:.2e} FLOPs")
print(f"  Inference cost:  {i70:.2e} FLOPs (1B queries)")
print(f"  Total cost:      {tot70:.2e} FLOPs")
print(f"  Expected loss:   {l70:.3f}")

print("\n=== Inference-optimal: 8B model, 15T tokens ===")
t8, i8, tot8 = total_cost(8e9, 15e12, n_queries=1e9)
l8 = chinchilla_loss_fn(8e9, 15e12)
print(f"  Training cost:   {t8:.2e} FLOPs")
print(f"  Inference cost:  {i8:.2e} FLOPs (1B queries)")
print(f"  Total cost:      {tot8:.2e} FLOPs")
print(f"  Expected loss:   {l8:.3f}")

print(f"\nInference savings: {i70/i8:.1f}x cheaper per query with 8B model")
print(f"Total savings:     {tot70/tot8:.1f}x cheaper overall at 1B queries")

# Output:
# === Chinchilla-optimal: 70B model, 1.4T tokens ===
#   Training cost:   5.88e+23 FLOPs
#   Inference cost:  7.00e+22 FLOPs (1B queries)
#   Total cost:      6.58e+23 FLOPs
#   Expected loss:   1.937
#
# === Inference-optimal: 8B model, 15T tokens ===
#   Training cost:   7.20e+23 FLOPs
#   Inference cost:  8.00e+21 FLOPs (1B queries)
#   Total cost:      7.28e+23 FLOPs
#   Expected loss:   1.949
#
# Inference savings: 8.8x cheaper per query with 8B model
# Total savings:     0.9x cheaper overall at 1B queries

The 8B model costs more to train (15T tokens vs 1.4T) but is 8.8x cheaper per query. At a billion queries, the 70B model is still slightly cheaper overall—but the crossover happens around 3 billion queries, and by 10 billion queries the 8B model is 1.6x cheaper total. For any model serving real production traffic, inference dominates. This is the economic logic driving the entire open-weight model ecosystem.

The Emergent Abilities Debate

Scaling laws predict loss with stunning precision. But does predictable loss translate to predictable capabilities? In 2022, Jason Wei and colleagues documented emergent abilities—tasks that seem to be impossible below a certain model size and then suddenly work above it. Three-digit addition. Word unscrambling. Multi-step logical reasoning. The model can’t do it, can’t do it, can’t do it—and then suddenly it can.

This finding electrified the field and fueled both excitement (“larger models will develop qualitatively new capabilities!”) and anxiety (“we can’t predict when dangerous capabilities will appear!”).

Then in 2023, Rylan Schaeffer and colleagues published a NeurIPS Outstanding Paper making a provocative counterargument: emergence is a mirage, an artifact of the evaluation metrics, not the models. Their key finding: 92% of the “emergent” tasks in the BIG-Bench benchmark used discontinuous metrics like exact-match accuracy (either the answer is perfectly right or it scores zero). When the same model outputs were re-scored with continuous metrics (like token edit distance—how close was the answer?), the sharp jumps vanished. Performance improved smoothly and predictably the whole time.

The model wasn’t suddenly gaining a new ability. It was gradually getting better at the task, and the metric was hiding that improvement until the model crossed the threshold of getting it exactly right.

The current consensus leans toward Schaeffer: per-token loss improves smoothly according to scaling laws. Apparent “phase transitions” in downstream tasks are primarily measurement artifacts from coarse evaluation. This doesn’t make scaling less interesting—it makes it more predictable. And it’s a powerful reminder from our loss functions post: the choice of evaluation metric matters enormously.

Data, Repetition, and the Token Exhaustion Crisis

Scaling laws assume an infinite supply of unique training data. Reality is less generous. Muennighoff et al. (2023) showed that repeating data has sharply diminishing returns: up to about 4 epochs of repetition, the loss degradation is negligible. Beyond that, the value of additional compute decays toward zero. The model starts memorizing rather than learning.

This creates a looming problem. The total supply of high-quality, deduplicated English text on the internet is estimated at roughly 3–5 trillion tokens. LLaMA-3 was trained on 15 trillion tokens—already far exceeding the English supply by mixing in code, multilingual data, and synthetic text generated by other models.

The binding constraint on scaling may not be compute or parameters—it may be data. Several approaches are emerging to address this: synthetic data generation (using large models to generate training data for smaller ones), data quality improvements (better filtering and deduplication), multi-modal training (images, audio, and video as additional data sources), and curriculum learning (presenting data in an optimized order to extract more learning per token).

The data mixture matters too: the optimal ratio of code to natural language to mathematics to instruction-following data changes with model size. Smaller models benefit from a narrower, more focused training diet, while larger models can productively absorb a broader mixture.

Test-Time Compute: The New Scaling Axis

Training compute was the first scaling axis. Inference compute is emerging as the second. Rather than building a bigger model, what if you let a smaller model think longer on each query?

Two mechanisms have proven effective:

Parallel scaling: Generate multiple candidate answers and pick the best one. Best-of-N sampling, majority voting, and verification all fall here. More candidates = better answers, with a power law relationship.
Sequential scaling: Let the model chain together intermediate reasoning steps. Chain-of-thought prompting, self-refinement, and tree search all increase the effective depth of computation per query.

Snell et al. (2024) showed a remarkable result: a smaller model with optimal test-time compute allocation can outperform a model that is 14x larger using the same total FLOPs. The key insight is that test-time compute has a highly uneven return—easy questions gain almost nothing from extra thinking, while hard questions benefit enormously. Optimal allocation means spending more compute on harder problems.

OpenAI’s o1 and o3 models operationalize this idea. They use extended chain-of-thought reasoning—sometimes generating thousands of tokens of internal deliberation before producing an answer. This is test-time compute scaling in action, and it represents a new dimension of the scaling equation that Kaplan’s original formulation didn’t consider.

Richard Sutton captured the broader pattern in his influential 2019 essay “The Bitter Lesson”: general methods that leverage computation always win in the long run. Hand-crafted chess engines were beaten by brute-force search. Custom vision features were beaten by learned features on bigger datasets. Task-specific NLP pipelines were beaten by scaled-up language models. The pattern is consistent: simple methods + massive compute beats clever methods + limited compute.

Try It: Scaling Law Explorer

Interactive: Scaling Law Explorer

Explore how loss decreases as a power law with model size, data, and compute. Adjust the exponent and irreducible loss to see their effect.

Parameters

Data

Compute

α exponent: 0.076

L_∞: 1.69

Hover over the plot to see predictions

Try It: Compute Allocator

Interactive: Compute Allocator

Given a fixed compute budget, see how Kaplan and Chinchilla allocate differently between model size and training data. The heatmap shows the Chinchilla loss surface; the white curve is the isoFLOP constraint.

Compute budget: 10^21

The Road Ahead

Scaling laws are the closest thing AI has to physics equations—predictive, precise, and governing billions of dollars in investment decisions. They’ve been the secret weapon of every major AI lab: train a small model, measure the exponent, extrapolate the curve, and decide whether the full-scale training run is worth the cost.

We’ve traversed three eras: Kaplan (scale parameters aggressively), Chinchilla (scale both equally), and LLaMA (over-train for inference). Each correction didn’t invalidate the power law—it refined the exponents and the optimization target. The underlying truth remains: more compute yields predictably better models, and power laws tell you exactly how much better.

The open questions are fascinating. Will data exhaustion bend the scaling curves? Can synthetic data keep the laws alive? Is test-time compute a genuine new axis of scaling, or does it hit diminishing returns more quickly? And the deepest question of all: do these power laws continue indefinitely, or is there a wall where neural networks stop getting predictably better?

Whatever the answers, the scaling laws have taught us something profound about intelligence—artificial and otherwise. Simple learning algorithms, given enough data and compute, produce capabilities that no amount of hand-engineering could match. That’s not just a fact about AI. It might be a fact about learning itself.

Connections to the Elementary Series

Loss Functions: Scaling laws predict cross-entropy loss—the evaluation metric debate (continuous vs discrete) connects directly to the emergent abilities discussion.
Optimizers: Learning rate and batch size must scale with model size. Chinchilla’s key insight was that Kaplan’s training hyperparameters were suboptimal for small models.
Transformers: The architecture whose scaling behavior we’re studying. Scaling laws are surprisingly architecture-agnostic, but transformers enabled the largest validated range.
Quantization: Post-training compression trades some scaling gains for inference efficiency—the inference-optimal regime makes quantization even more valuable.
Knowledge Distillation: Transferring a large model’s scaling gains into a small model. Distillation is another strategy for the inference-optimal frontier.
LoRA: Parameter-efficient fine-tuning as an alternative to full scaling—adapt a pre-trained model’s scaling gains to new tasks without retraining.
Speculative Decoding: An inference-time optimization that exploits the gap between small and large models—a direct beneficiary of the scaling law relationship between model sizes.

References & Further Reading

Kaplan et al. — “Scaling Laws for Neural Language Models” (2020) — The original discovery: power laws for parameters, data, and compute, plus the (later-corrected) compute-optimal allocation.
Hoffmann et al. — “Training Compute-Optimal Large Language Models” (2022) — The Chinchilla paper that corrected Kaplan’s allocation and changed the industry.
Wei et al. — “Emergent Abilities of Large Language Models” (2022) — The paper that documented apparent phase transitions in capabilities as models scale.
Schaeffer, Miranda, & Koyejo — “Are Emergent Abilities of Large Language Models a Mirage?” (2023) — NeurIPS Outstanding Paper arguing that emergence is a measurement artifact.
Muennighoff et al. — “Scaling Data-Constrained Language Models” (2023) — How data repetition, data quality, and code mixing affect scaling behavior.
Snell et al. — “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Size” (2024) — Test-time compute as a new scaling axis.
Sardana & Frankle — “Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws” (2024) — The economic case for over-training smaller models.
Sutton — “The Bitter Lesson” (2019) — The classic essay arguing that general methods leveraging computation always win.