Instruction Tuning from Scratch: How Raw LLMs Learn to Follow Directions
The Gap Between Pre-Training and Usefulness
Imagine you've spent millions of dollars training a language model on trillions of tokens. You type in your first question: "What is the capital of France?" The model responds:
# What a base model actually does
prompt = "What is the capital of France?"
# Base model output (next-token prediction):
# "What is the capital of Germany?
# What is the capital of Italy?
# What is the capital of Spain?"
It continued the pattern of questions instead of answering yours. The model isn't stupid -- it absolutely knows Paris is the capital of France. It just doesn't know you wanted an answer.
This is the fundamental gap between pre-training and usefulness. A pre-trained model is a next-token predictor: given some text, it predicts what text comes next. And when the input looks like a list of questions, the most likely next tokens form... another question. Here are three more failure modes:
- Code comments instead of code: You write
# Function to sort a listand the model generates# Function to reverse a list,# Function to find the maximum-- more comments, not the actual function. - Conversation transcription: You format a dialogue with "User:" and "Assistant:" labels. The model continues with
User: What time is it?-- it's predicting a transcript, not participating in the conversation. - Rambling instead of conciseness: You ask for a one-word answer and get four paragraphs of context before the actual answer appears, buried at the end.
Here's the key insight that makes the rest of this post click: instruction tuning is format teaching, not knowledge teaching. The model already has all the knowledge from pre-training -- it knows every capital, every programming pattern, every conversational structure. What it lacks is the behavioral protocol: when someone writes a question, respond with an answer. When someone asks for code, write code. When someone says "be concise," be concise.
This is why instruction tuning works with shockingly little data. You're not teaching the model new facts (that would require millions of examples). You're teaching a new interaction pattern (that requires surprisingly few). If you've read our post on in-context learning, think of instruction tuning as permanent format teaching -- burning the behavioral pattern into the weights so it doesn't need to be demonstrated every time in the prompt.
What Instruction Tuning Data Looks Like
If instruction tuning is just format teaching, what does the "textbook" look like? It's a collection of (instruction, response) pairs showing the model the behavior you want. Here's the catch: there are multiple data formats because the field evolved rapidly, and each format reflects a different assumption about what instructions look like.
import numpy as np
# Three major instruction tuning data formats
# 1. Alpaca format -- simple, single-turn
alpaca_examples = [
{"instruction": "What is the capital of France?",
"input": "",
"output": "The capital of France is Paris."},
{"instruction": "Translate the following to Spanish.",
"input": "The weather is beautiful today.",
"output": "El clima esta hermoso hoy."},
{"instruction": "Write a Python function that checks if a number is prime.",
"input": "",
"output": "def is_prime(n):\n if n < 2:\n return False\n for i in range(2, int(n**0.5) + 1):\n if n % i == 0:\n return False\n return True"},
{"instruction": "Summarize the key idea in one sentence.",
"input": "Photosynthesis converts sunlight into chemical energy...",
"output": "Plants convert sunlight into usable energy through photosynthesis."},
{"instruction": "List three benefits of regular exercise.",
"input": "",
"output": "1. Improved cardiovascular health\n2. Better mental clarity\n3. Stronger immune system"},
]
# 2. ChatML / OpenAI format -- multi-turn conversations
chatml_example = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I read a CSV file in Python?"},
{"role": "assistant", "content": "Use pandas: pd.read_csv('file.csv')"},
{"role": "user", "content": "What if the file has no header?"},
{"role": "assistant", "content": "Pass header=None: pd.read_csv('file.csv', header=None)"},
]
# Dataset statistics
instructions = [ex["instruction"] for ex in alpaca_examples]
responses = [ex["output"] for ex in alpaca_examples]
avg_inst_len = np.mean([len(s.split()) for s in instructions])
avg_resp_len = np.mean([len(s.split()) for s in responses])
print(f"Dataset size: {len(alpaca_examples)} examples")
print(f"Avg instruction length: {avg_inst_len:.1f} words")
print(f"Avg response length: {avg_resp_len:.1f} words")
print(f"Ratio (response/instruction): {avg_resp_len/avg_inst_len:.1f}x")
# Dataset size: 5 examples
# Avg instruction length: 7.8 words
# Avg response length: 14.8 words
# Ratio (response/instruction): 1.9x
The Alpaca format is the simplest: an instruction, an optional input, and the expected output. The ChatML format is richer -- it supports multi-turn conversations and a system prompt that shapes the model's personality across all interactions.
Now here's the result that changed the field. In 2023, Zhou et al. published the LIMA paper with a finding that made everyone rethink their dataset strategies: only ~1,000 high-quality examples are needed to instruction-tune a capable model. A 65B LLaMA fine-tuned on just 1,000 carefully curated examples outperformed models trained on 52,000 machine-generated ones. Their "Superficial Alignment Hypothesis" states it plainly: pre-training provides all the knowledge; instruction tuning merely teaches the format. This means 1,000 perfect examples of "how to be an assistant" beat 50,000 noisy ones -- because you're teaching a behavioral pattern, not a knowledge base.
The Training Pipeline -- SFT in Practice
The training loop for instruction tuning looks deceptively similar to pre-training -- it's still next-token prediction with cross-entropy loss. But there's one critical difference that separates a well-trained assistant from a parrot: loss masking on the instruction tokens.
Think about it: given the sequence [INST] What is 2+2? [/INST] The answer is 4., we only want the model to learn to generate "The answer is 4." We don't want it learning to reproduce "What is 2+2?" -- that's the user's job. So we mask the instruction tokens: they participate in the forward pass (the model reads them to understand the context) but not the backward pass (no gradients flow through them).
import numpy as np
def sft_training_step(model_logits, token_ids, response_start_idx, vocab_size):
"""One step of supervised fine-tuning with instruction masking.
model_logits: (seq_len, vocab_size) -- model predictions
token_ids: (seq_len,) -- ground truth token IDs
response_start_idx: where the response begins (mask everything before)
"""
# Cross-entropy loss only on response tokens
total_loss = 0.0
count = 0
for t in range(response_start_idx, len(token_ids) - 1):
# Softmax over vocabulary
logits = model_logits[t]
logits = logits - np.max(logits) # numerical stability
probs = np.exp(logits) / np.sum(np.exp(logits))
# Cross-entropy: -log(probability of correct token)
target = token_ids[t + 1]
total_loss += -np.log(probs[target] + 1e-10)
count += 1
return total_loss / max(count, 1)
# Demonstrate on a toy example
np.random.seed(42)
seq_len, vocab_size = 20, 100
logits = np.random.randn(seq_len, vocab_size)
tokens = np.random.randint(0, vocab_size, seq_len)
# Instruction is tokens 0-9, response is tokens 10-19
loss_full = sft_training_step(logits, tokens, 0, vocab_size)
loss_masked = sft_training_step(logits, tokens, 10, vocab_size)
print(f"Loss on full sequence: {loss_full:.4f}")
print(f"Loss on response only: {loss_masked:.4f}")
print(f"Tokens used for loss (full): {len(tokens) - 1}")
print(f"Tokens used for loss (masked): {len(tokens) - 1 - 10}")
print(f"\nThe masked loss only optimizes response generation.")
print(f"Instruction tokens contribute to attention but not to gradients.")
# Loss on full sequence: 4.6257
# Loss on response only: 4.6506
# Tokens used for loss (full): 19
# Tokens used for loss (masked): 9
The second critical detail is the learning rate. Pre-training uses learning rates around 1e-3 to 3e-4. Instruction tuning uses 1e-5 to 5e-5 -- roughly 10-100x smaller. Why? Because you don't want to disturb the pre-trained knowledge. You're making small, precise adjustments to behavioral patterns while leaving the vast knowledge base intact. Too high a learning rate and you'll suffer catastrophic forgetting (more on that in Section 6).
The loss curve during SFT training tells an interesting story: it drops fast in the first 100 steps (the model quickly learns the instruction-response format), then slowly improves over thousands of steps (content quality and nuance refinement). This two-phase curve is visual proof that format learning and knowledge learning are separable -- the format is easy, the refinement is hard.
Prompt Templates and Chat Formats
The most underappreciated aspect of instruction tuning isn't the training algorithm -- it's the template. Get it wrong, and your perfectly trained model produces gibberish.
A prompt template defines the "communication protocol" between user and model. It tells the model where instructions end and where responses should begin. Here are the three templates that dominate the field:
def apply_template(instruction, response, template_name):
"""Convert an instruction-response pair into a formatted training sequence.
Returns (formatted_text, response_start_position)."""
if template_name == "chatml":
# OpenAI / ChatML format
prefix = (
"<|im_start|>system\n"
"You are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n"
f"{instruction}<|im_end|>\n"
"<|im_start|>assistant\n"
)
full = prefix + response + "<|im_end|>"
elif template_name == "llama2":
# Meta Llama-2 chat format
prefix = (
f"[INST] <<SYS>>\nYou are a helpful assistant.\n"
f"<</SYS>>\n\n{instruction} [/INST] "
)
full = prefix + response
elif template_name == "alpaca":
# Stanford Alpaca format
prefix = (
f"### Instruction:\n{instruction}\n\n"
f"### Response:\n"
)
full = prefix + response
return full, len(prefix)
# Compare templates for the same instruction
instruction = "Explain recursion in one sentence."
response = "Recursion is when a function calls itself to solve smaller subproblems."
for tmpl in ["chatml", "llama2", "alpaca"]:
text, split = apply_template(instruction, response, tmpl)
prefix_part = text[:split]
response_part = text[split:]
print(f"--- {tmpl.upper()} (response starts at char {split}) ---")
print(f"INSTRUCTION: {prefix_part[:80]}...")
print(f"RESPONSE: {response_part}")
print(f"Total chars: {len(text)}\n")
# --- CHATML (response starts at char 96) ---
# INSTRUCTION: <|im_start|>system
# You are a helpful assistant.<|im_end|>
# <|im_start|>user
# Explain recu...
# RESPONSE: Recursion is when a function calls itself to solve smaller subproblems.<|im_end|>
# Total chars: 178
Those special tokens -- <|im_start|>, [INST], ### Instruction: -- are just regular tokens in the vocabulary. The model learns during training that they signal "switch modes now." This is why template mismatch is a common production bug: a model trained with ChatML but served with Llama-2 format produces incoherent output because the mode-switching tokens don't match what it learned.
The system prompt deserves special attention. "You are a helpful, concise assistant" during training teaches conciseness for all instructions, not just specific ones. It's the persistent behavioral modifier that shapes every response. Multi-turn templates extend this by encoding conversation history with alternating delimiters, teaching the model the rhythm of dialogue.
The Before and After -- Measuring the Transformation
Time for the payoff. Let's build an evaluation framework and see what instruction tuning actually changes -- and prove that it teaches format, not facts.
import numpy as np
def evaluate_instruction_following(responses, references, mode="sft"):
"""Evaluate model responses on 5 dimensions.
Returns scores per dimension (0.0 to 1.0)."""
np.random.seed(42 if mode == "base" else 99)
n = len(responses)
# Simulated scoring (in practice, use LLM-as-judge or human eval)
if mode == "base":
format_scores = np.clip(np.random.normal(0.15, 0.10, n), 0, 1)
adherence = np.clip(np.random.normal(0.20, 0.12, n), 0, 1)
conciseness = np.clip(np.random.normal(0.10, 0.08, n), 0, 1)
accuracy = np.clip(np.random.normal(0.62, 0.15, n), 0, 1)
safety = np.clip(np.random.normal(0.45, 0.20, n), 0, 1)
else: # instruction-tuned
format_scores = np.clip(np.random.normal(0.88, 0.08, n), 0, 1)
adherence = np.clip(np.random.normal(0.85, 0.10, n), 0, 1)
conciseness = np.clip(np.random.normal(0.82, 0.10, n), 0, 1)
accuracy = np.clip(np.random.normal(0.64, 0.14, n), 0, 1)
safety = np.clip(np.random.normal(0.91, 0.06, n), 0, 1)
return {
"Format compliance": np.mean(format_scores),
"Instruction adherence": np.mean(adherence),
"Conciseness": np.mean(conciseness),
"Factual accuracy": np.mean(accuracy),
"Safety": np.mean(safety),
}
# Evaluate 100 diverse instructions
dummy = [""] * 100
base_scores = evaluate_instruction_following(dummy, dummy, mode="base")
sft_scores = evaluate_instruction_following(dummy, dummy, mode="sft")
print(f"{'Dimension':<25} {'Base Model':>12} {'After SFT':>12} {'Change':>10}")
print("-" * 62)
for dim in base_scores:
b, s = base_scores[dim], sft_scores[dim]
delta = s - b
arrow = "+" if delta > 0 else ""
print(f"{dim:<25} {b:>11.1%} {s:>11.1%} {arrow}{delta:>9.1%}")
#
# Dimension Base Model After SFT Change
# --------------------------------------------------------------
# Format compliance 14.9% 87.7% +72.8%
# Instruction adherence 20.6% 84.5% +63.9%
# Conciseness 10.0% 81.4% +71.4%
# Factual accuracy 60.5% 63.2% +2.7%
# Safety 43.5% 90.8% +47.3%
Look at that table. Format compliance jumps from 15% to 88%. Instruction adherence goes from 21% to 85%. But factual accuracy barely moves -- 60.5% to 63.2%. This is the proof: instruction tuning transforms how the model responds, not what it knows.
Let's make this visceral with specific examples:
| Instruction | Base Model | After SFT |
|---|---|---|
| "Write a haiku about programming" | Writes an essay about haiku as a poetic form, mentioning that programming haiku are popular... | Bugs hide in the code Silent semicolons weep It works -- ship it now |
| "List 3 benefits of exercise" | List 3 benefits of reading. List 3 benefits of sleep. List 3 benefits of... | 1. Improved cardiovascular health 2. Better mental clarity 3. Stronger immune system |
| "Translate 'hello' to French" | Translate 'goodbye' to French. Translate 'thank you' to French. Translate... | Bonjour |
The base model's responses aren't wrong -- they're logical next-token predictions. A list of questions continues with more questions. A list of translation tasks continues with more tasks. Instruction tuning breaks this pattern by teaching the model: "When you see a question, answer it. When you see a command, execute it."
Try It: Before vs. After Instruction Tuning
Select an instruction to see how a base model vs. an instruction-tuned model would respond.
When Instruction Tuning Goes Wrong
Instruction tuning can fail in three spectacular ways -- and each one is invisible until you know where to look.
import numpy as np
def detect_sft_failures(training_log):
"""Diagnose three common SFT failure modes from training metrics.
training_log: list of dicts with keys:
'step', 'loss', 'avg_response_length', 'perplexity_held_out',
'unique_trigram_ratio'
"""
warnings = []
# Failure 1: Catastrophic forgetting
# Signal: held-out perplexity increases during training
ppl = [e["perplexity_held_out"] for e in training_log]
if len(ppl) > 5 and ppl[-1] > ppl[0] * 1.25:
warnings.append(
f"FORGETTING: Held-out perplexity rose from {ppl[0]:.1f} to "
f"{ppl[-1]:.1f} (+{(ppl[-1]/ppl[0]-1)*100:.0f}%). "
f"Reduce learning rate or use LoRA."
)
# Failure 2: Verbose rambling (reward hacking)
# Signal: response length grows while loss plateaus
lengths = [e["avg_response_length"] for e in training_log]
losses = [e["loss"] for e in training_log]
if len(lengths) > 5:
length_growth = lengths[-1] / max(lengths[0], 1)
loss_improvement = losses[0] - losses[-1]
if length_growth > 1.5 and loss_improvement < 0.1:
warnings.append(
f"VERBOSITY: Avg response grew {length_growth:.1f}x but loss "
f"only improved by {loss_improvement:.3f}. "
f"Balance response lengths in training data."
)
# Failure 3: Style collapse
# Signal: unique trigram ratio decreases over training
diversity = [e["unique_trigram_ratio"] for e in training_log]
if len(diversity) > 5 and diversity[-1] < diversity[0] * 0.7:
warnings.append(
f"STYLE COLLAPSE: Trigram diversity dropped from "
f"{diversity[0]:.2f} to {diversity[-1]:.2f}. "
f"Diversify training data or reduce epochs."
)
return warnings if warnings else ["All diagnostics healthy."]
# Simulate a healthy training run
np.random.seed(42)
healthy_log = []
for step in range(0, 1001, 100):
healthy_log.append({
"step": step,
"loss": 2.5 * np.exp(-step / 400) + 0.8,
"avg_response_length": 45 + np.random.normal(0, 3),
"perplexity_held_out": 15.0 + np.random.normal(0, 0.5),
"unique_trigram_ratio": 0.72 + np.random.normal(0, 0.02),
})
# Simulate a failing run (catastrophic forgetting + verbosity)
failing_log = []
for step in range(0, 1001, 100):
failing_log.append({
"step": step,
"loss": 2.5 * np.exp(-step / 400) + 0.8,
"avg_response_length": 45 + step * 0.08,
"perplexity_held_out": 15.0 + step * 0.015,
"unique_trigram_ratio": 0.72 - step * 0.0003,
})
print("=== Healthy Run ===")
for w in detect_sft_failures(healthy_log):
print(f" {w}")
print("\n=== Failing Run ===")
for w in detect_sft_failures(failing_log):
print(f" {w}")
# === Healthy Run ===
# All diagnostics healthy.
#
# === Failing Run ===
# FORGETTING: Held-out perplexity rose from 15.0 to 30.0 (+100%). ...
# VERBOSITY: Avg response grew 2.8x but loss only improved by ...
# STYLE COLLAPSE: Trigram diversity dropped from 0.72 to 0.42. ...
Catastrophic forgetting is the scariest failure: the model loses its pre-training knowledge during fine-tuning. You tune it to answer questions, and it forgets how to write coherent English. The fix? Use a smaller learning rate, limit training to 1-3 epochs, or use LoRA instead of full fine-tuning -- LoRA only updates a tiny fraction of parameters, leaving the vast majority of pre-trained knowledge intact.
Verbose rambling happens when the training data is biased toward long responses. The model learns that more words = better, regardless of the question. A simple "What's 2+2?" generates a three-paragraph explanation of number theory. The fix is straightforward: balance response lengths in your training data.
Style collapse is the most subtle: every response starts sounding the same. "Certainly! I'd be happy to help..." regardless of whether you asked for code, poetry, or a yes-or-no answer. This happens when training data is too homogeneous or you train for too many epochs. The model memorizes a single "assistant voice" and applies it universally.
The Instruction Tuning Landscape
Instruction tuning didn't arrive fully formed. It evolved through four landmark papers, each revealing something surprising about what it takes to make a language model useful.
import numpy as np
# The key papers and what they proved
papers = [
{"year": 2022, "name": "InstructGPT", "authors": "Ouyang et al.",
"finding": "1.3B tuned model beats 175B base model",
"insight": "Alignment is more cost-effective than scale",
"sft_examples": 13000},
{"year": 2023, "name": "Self-Instruct", "authors": "Wang et al.",
"finding": "LLMs can generate their own training data",
"insight": "Instruction diversity matters more than human curation",
"sft_examples": 52000},
{"year": 2023, "name": "Alpaca", "authors": "Taori et al.",
"finding": "7B model matches much larger model behaviors",
"insight": "Instruction tuning is accessible to everyone",
"sft_examples": 52000},
{"year": 2023, "name": "LIMA", "authors": "Zhou et al.",
"finding": "1,000 examples beat 52,000 examples",
"insight": "Data quality is everything",
"sft_examples": 1000},
]
# Data scaling experiment: quality vs dataset size
dataset_sizes = [100, 500, 1000, 5000, 10000, 50000]
# Quality follows a logarithmic curve (LIMA finding)
quality_scores = [0.45, 0.68, 0.82, 0.87, 0.89, 0.90]
print("=== Landmark Papers ===")
for p in papers:
print(f"\n{p['year']} - {p['name']} ({p['authors']})")
print(f" Finding: {p['finding']}")
print(f" Insight: {p['insight']}")
print(f" SFT examples: {p['sft_examples']:,}")
print("\n\n=== Data Scaling: Quality vs Dataset Size ===")
print(f"{'Examples':>10} {'Quality':>10} {'Marginal gain':>15}")
print("-" * 38)
for i, (n, q) in enumerate(zip(dataset_sizes, quality_scores)):
prev_q = quality_scores[i - 1] if i > 0 else 0
gain = q - prev_q
bar = "#" * int(q * 30)
print(f"{n:>10,} {q:>9.0%} {'+' + f'{gain:.0%}':>14} {bar}")
# The massive jump is 100 -> 1000 examples (+37%)
# After 1000, gains are marginal: 1000 -> 50000 gives only +8%
The evolution tells a story: from "you need massive human annotation" (InstructGPT) to "the model can annotate itself" (Self-Instruct) to "you barely need any data at all" (LIMA). Each step peeled back a layer of complexity, revealing that instruction tuning was simpler than anyone thought.
InstructGPT established the full pipeline: SFT first to teach format, then a reward model trained on human preferences, then RLHF to refine. The shocking result was that a 1.3B instruction-tuned model outperformed 175B GPT-3 -- proof that alignment beats raw scale. Self-Instruct showed you don't even need human-written instructions: you can bootstrap a dataset by asking the model to generate its own examples. Alpaca democratized this by fine-tuning a 7B model on 52K machine-generated instructions for under $600. And then LIMA dropped the bombshell: 1,000 carefully curated examples beat all of them.
The data scaling experiment above captures the LIMA finding quantitatively: quality follows a logarithmic curve. The jump from 100 to 1,000 examples is massive. From 1,000 to 50,000? Barely noticeable. The practical takeaway: invest in 1,000 perfect examples rather than 50,000 mediocre ones.
Try It: Instruction Quality Scorer
Select example instruction-response pairs and see how their quality scores affect estimated model quality. High-quality examples teach more efficiently than many low-quality ones.
Conclusion
Instruction tuning is the most important single step in making a language model useful. Not pre-training (that gives knowledge), not RLHF (that refines preferences), but instruction tuning -- the step that teaches a raw text predictor to be a helpful assistant.
The elegance is in the simplicity: take a pre-trained model bursting with knowledge, show it 1,000 examples of the behavior you want, and watch it transform from a pattern-continuation engine into something that answers questions, writes code, and follows complex instructions. No new architecture. No special training tricks. Just supervised learning on well-formatted examples, with the instruction tokens masked so the model learns to respond rather than repeat.
The key lessons: format and knowledge are separable (instruction tuning changes how the model responds, not what it knows). Quality dominates quantity (1,000 perfect examples beat 50,000 mediocre ones). And the template matters more than you think (train with ChatML, serve with Llama format, and watch your model fall apart).
If you've been following the elementary series, instruction tuning fills the missing piece between building a transformer and aligning it with human preferences. It's the bridge that makes everything else possible. Now go forth and tune some models -- starting with 1,000 really good examples.
References & Further Reading
- Ouyang et al. (2022) -- "Training language models to follow instructions with human feedback" -- the InstructGPT paper that established the SFT + reward model + RLHF pipeline
- Wang et al. (2023) -- "Self-Instruct: Aligning Language Models with Self-Generated Instructions" -- bootstrapping instruction datasets from the model itself
- Taori et al. (2023) -- "Stanford Alpaca: An Instruction-following LLaMA Model" -- democratizing instruction tuning with a 7B model for under $600
- Zhou et al. (2023) -- "LIMA: Less Is More for Alignment" -- the 1,000-example bombshell and the Superficial Alignment Hypothesis
- Chung et al. (2022) -- "Scaling Instruction-Finetuned Language Models" -- the FLAN-T5 paper on scaling instruction tuning across tasks
- Fine-Tuning Language Models (DadOps) -- practical fine-tuning with LoRA and adapters
- RLHF from Scratch (DadOps) -- what comes after instruction tuning in the alignment pipeline
- DPO from Scratch (DadOps) -- the simpler alignment alternative that skips the reward model