Synthetic Data Generation: Using LLMs to Build Your Own Training Datasets
The Data Bottleneck — Why Synthetic Data Changes Everything
You've read every fine-tuning tutorial. You understand LoRA, you've picked your base model, you've got your training script ready. Then you hit the wall that stops 90% of fine-tuning projects cold: you don't have any labeled training data.
This is the labeled data bottleneck, and it's the single biggest barrier between "I want to fine-tune a model" and actually doing it. Most teams have plenty of raw text — customer emails, support tickets, product descriptions — but almost no labeled pairs mapping inputs to desired outputs. And hiring human annotators to create those pairs? That's $5,000–$20,000 and 2–4 weeks per thousand examples.
Synthetic data generation flips this equation. Instead of paying humans to create training data, you use a strong model (like GPT-4o or Claude) to generate training data for a weaker, cheaper model. You're distilling intelligence into data — and the economics are staggering:
| Approach | Cost / 1K Examples | Time | Quality | Best For |
|---|---|---|---|---|
| Human Annotation | $5,000 – $20,000 | 2–4 weeks | Gold standard | Regulatory, safety-critical |
| LLM-Generated | $10 – $50 | 2–4 hours | Good (85–92%) | Rapid prototyping, iteration |
| Hybrid (LLM + Human QA) | $200 – $500 | 3–5 days | Very high (95%+) | Production fine-tuning |
The hybrid approach is the sweet spot for most teams: generate thousands of examples with an LLM, then have a human review a 10–20% sample to catch systematic errors. This post walks through four progressively sophisticated generation techniques, shows you how to filter for quality, and ties them together into a complete pipeline you can run today.
Three common starting points determine which technique you need:
- Zero labeled data — you only have a task description. Use Self-Instruct (Section 2).
- A handful of seeds — you have 10–50 hand-labeled examples. Use Few-Shot Amplification (Section 3).
- An easy dataset — you have simple examples but need harder ones. Use Evol-Instruct (Section 4).
Self-Instruct — Generating from Task Descriptions Alone
The breakthrough insight of Self-Instruct (Wang et al., 2023) is beautifully simple: if a large language model can solve a task, it can also generate examples of that task. Describe what you want, and the model produces (input, output) pairs that match your description.
The algorithm works in a loop: start with a task description, generate a batch of examples, feed recent examples back as negative examples (to encourage diversity), filter duplicates, and repeat until you hit your target count. Stanford's Alpaca project used this approach to generate 52,000 instruction-following examples for under $500 — proving it works at scale.
The naive version has a problem, though: LLMs naturally generate repetitive, easy examples. Ask for 100 sentiment examples and you'll get 40 variations of "This product is great!" We fix this with diversity pressure — showing the model its own recent output and explicitly asking for different patterns:
import openai, json, random
from collections import Counter
client = openai.OpenAI()
TASK = """Generate diverse sentiment analysis training examples.
Each example: a JSON object with "text" and "label" fields.
Labels: positive, negative, neutral.
Vary length (5-50 words), domain (products, food, travel,
services, entertainment), and difficulty (obvious to subtle)."""
def self_instruct(task_desc, target=100, batch_size=5):
"""Generate labeled examples from a task description alone."""
pool, seen = [], set()
for _ in range(target // batch_size):
# Show recent examples as negative examples for diversity
recent = json.dumps(pool[-6:], indent=2) if pool else "[]"
resp = client.chat.completions.create(
model="gpt-4o-mini", # Cheap model for bulk generation
temperature=1.0, # High temp = more diversity
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": task_desc},
{"role": "user", "content": f"""Generate {batch_size} NEW examples.
They must be DIFFERENT from these existing ones:
{recent}
Mix: sarcasm, subtle opinions, mixed feelings, obvious cases.
Return JSON with an "examples" array."""}
]
)
batch = json.loads(resp.choices[0].message.content)
for ex in batch.get("examples", []):
key = ex["text"].strip().lower()
if key not in seen:
seen.add(key)
pool.append(ex)
labels = Counter(ex["label"] for ex in pool)
print(f"Generated {len(pool)} examples: {dict(labels)}")
return pool
# Generate 100 sentiment examples from scratch — no labeled data needed
examples = self_instruct(TASK, target=100)
# Cost: ~$0.03 with gpt-4o-mini (Feb 2026 pricing)
Two design choices make this work. First, high temperature (1.0) pushes the model toward less probable completions, which means more diverse examples. Second, showing recent output as negative examples creates an implicit diversity loop — the model sees what it already generated and actively avoids repeating those patterns.
Self-Instruct is your cold-start tool. No labeled data needed, just a well-written task description. The quality ceiling is moderate (75–82% label accuracy), but the cost floor is as low as it gets: roughly $0.03 per 100 examples with gpt-4o-mini.
Few-Shot Amplification — Scaling Seed Examples
Self-Instruct works when you're starting from zero. But if you have even 10–50 hand-labeled examples — from a pilot annotation, from your own manual labeling, or from production logs — you can do significantly better. Few-Shot Amplification uses your real examples as demonstrations, grounding the generated data in the actual style, difficulty, and distribution of your task.
The key technique is stratified random sampling: for each generation call, randomly select 3–5 seed examples from different categories. This prevents the model from fixating on one pattern while keeping it anchored to your real data. The result is synthetic data that feels like it came from the same distribution as your seeds — just hundreds of times more of it.
import random, json
SEEDS = [
{"text": "I need to change my delivery address", "intent": "address_change"},
{"text": "Where's my package? It's been two weeks", "intent": "order_status"},
{"text": "Can I get a refund for the broken item?", "intent": "refund_request"},
{"text": "How do I reset my password?", "intent": "account_help"},
{"text": "Do you ship internationally?", "intent": "shipping_info"},
{"text": "I want to cancel my subscription", "intent": "cancellation"},
# ... imagine 14 more spanning all intent categories
]
INTENTS = list(set(s["intent"] for s in SEEDS))
def few_shot_amplify(seeds, target=500, shots=3):
"""Scale seed examples into a larger dataset via few-shot prompting."""
synthetic = []
for i in range(target // 5):
# Stratified sampling: pick seeds from different intents
sampled = []
for _ in range(shots):
intent = random.choice(INTENTS)
candidates = [s for s in seeds if s["intent"] == intent]
sampled.append(random.choice(candidates))
examples_block = "\n".join(
f' "{s["text"]}" -> {s["intent"]}' for s in sampled
)
resp = client.chat.completions.create(
model="gpt-4o", # Stronger model for quality
temperature=0.9,
messages=[{"role": "user", "content": f"""Real customer messages:
{examples_block}
Generate 5 NEW messages with correct intents.
Same style and difficulty. Valid intents: {INTENTS}
Return a JSON array of {{"text": "...", "intent": "..."}} objects."""}]
)
batch = json.loads(resp.choices[0].message.content)
items = batch if isinstance(batch, list) else batch.get("examples", [])
for ex in items:
if ex.get("intent") in INTENTS:
# Novelty check: how different from nearest seed?
seed_words = [set(s["text"].lower().split()) for s in seeds]
ex_words = set(ex["text"].lower().split())
max_overlap = max(
len(ex_words & sw) / max(len(ex_words), 1)
for sw in seed_words
)
ex["novelty"] = round(1.0 - max_overlap, 2)
synthetic.append(ex)
avg_novelty = sum(e["novelty"] for e in synthetic) / len(synthetic)
print(f"Generated {len(synthetic)} examples (avg novelty: {avg_novelty:.2f})")
return synthetic
The novelty score (0–1) measures word-level overlap between each generated example and its nearest seed. A score of 0.0 means it's a near-copy; 1.0 means completely novel vocabulary. In practice, you want examples between 0.3 and 0.8 — similar enough to be on-task, different enough to add training signal.
This is the diversity-fidelity tradeoff: too much diversity and the generated examples drift from your actual task; too little and you're just paraphrasing your seeds. Few-Shot Amplification typically produces 85–90% label accuracy — a meaningful jump over Self-Instruct — because the seeds ground the model in real examples.
Evol-Instruct — Making Simple Data Complex
Here's a pattern you'll notice in synthetic datasets: they're easy. LLMs naturally generate straightforward, unambiguous examples because those are the most common patterns in their training data. But easy examples carry little training signal — your model already handles them. The examples that actually improve a fine-tuned model are the hard ones: edge cases, ambiguous inputs, multi-step reasoning.
Evol-Instruct (Xu et al., 2023, from the WizardLM paper) solves this by systematically evolving simple examples into harder variants. The technique applies five "evolution operators" that each push complexity in a different direction:
- Add Constraints — inject specific conditions ("only if the order is over $50 and within 30 days")
- Deepen Reasoning — require multi-step logic or comparison
- Increase Specificity — add domain context (industry, technology, scenario)
- Inject Ambiguity — make the question require clarification or interpretation
- Require Tradeoffs — reframe so the answer must weigh pros and cons
EVOLUTION_OPS = {
"add_constraints": "Add 2-3 specific constraints or conditions.",
"deepen_reasoning": "Require multi-step reasoning or comparison.",
"increase_specificity": "Add domain-specific context and details.",
"inject_ambiguity": "Make it require clarification or interpretation.",
"require_tradeoffs": "Reframe so the answer weighs pros and cons.",
}
def evolve_example(example, op_name, op_prompt):
"""Apply one evolution operator to increase difficulty."""
resp = client.chat.completions.create(
model="gpt-4o",
temperature=0.7,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": f"""Evolve this Q&A to be harder.
Original Question: {example["question"]}
Original Answer: {example["answer"]}
Evolution strategy: {op_prompt}
Return JSON with "question" and "answer" fields.
The evolved question must be harder but still answerable."""}]
)
return json.loads(resp.choices[0].message.content)
def evol_instruct(simple_examples, rounds=2):
"""Evolve simple examples into progressively harder variants."""
all_examples = list(simple_examples) # Keep originals
current = simple_examples
for round_num in range(rounds):
evolved_batch = []
operators = list(EVOLUTION_OPS.items())
for ex in current:
op_name, op_prompt = random.choice(operators)
evolved = evolve_example(ex, op_name, op_prompt)
evolved["round"] = round_num + 1
evolved["operator"] = op_name
evolved_batch.append(evolved)
# Quality gate: reject if too similar to original
for orig, evol in zip(current, evolved_batch):
orig_words = set(orig["question"].lower().split())
evol_words = set(evol["question"].lower().split())
overlap = len(orig_words & evol_words) / len(orig_words | evol_words)
if overlap < 0.7: # Sufficiently different
all_examples.append(evol)
current = evolved_batch # Feed evolved into next round
print(f"Evolved {len(simple_examples)} -> {len(all_examples)} examples")
return all_examples
Watch the progression. A simple question like "What is machine learning?" evolves through rounds into something like "Compare supervised and unsupervised approaches for time-series anomaly detection in IoT sensor data, considering labeled data availability and real-time inference constraints." Each round makes it harder, and the quality gate rejects evolutions that don't actually add complexity (measured by word-level change).
The key insight: run evolution in rounds, feeding each round's output as the next round's input. Two rounds typically produce a good difficulty spread — easy originals, medium first-round evolutions, and hard second-round evolutions. This gives your fine-tuned model training signal across the full difficulty spectrum.
Quality Filtering — The Make-or-Break Step
Here's the counterintuitive truth about synthetic data: the generation step is the easy part. The filtering step determines whether your dataset actually works.
Raw synthetic data is noisy. In a typical batch of 1,000 generated examples, you'll find wrong labels (the model assigned "positive" to a clearly negative text), incoherent garbage (the model hallucinated a sentence fragment), near-duplicates (three examples that all say essentially the same thing), and trivially easy cases (the model generated simple examples that any baseline can already handle). Filtering these out is essential — 600 high-quality examples will outperform 1,000 unfiltered ones every time.
We build a four-stage filter pipeline, each stage catching a different failure mode:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def get_embeddings(texts, model="text-embedding-3-small"):
"""Get embeddings for a list of texts."""
resp = client.embeddings.create(input=texts, model=model)
return np.array([e.embedding for e in resp.data])
def quality_filter(examples, model="gpt-4o-mini"):
"""Four-stage quality filter for synthetic data."""
print(f"Starting with {len(examples)} examples")
# Stage 1: Coherence — is the example well-formed?
coherent = []
for ex in examples:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content":
f'Rate this example\'s coherence 1-5.\n'
f'Text: "{ex["text"]}"\nReply with just the number.'}]
)
score = int(resp.choices[0].message.content.strip())
if score >= 3:
ex["coherence"] = score
coherent.append(ex)
print(f" After coherence filter: {len(coherent)}")
# Stage 2: Label verification — does output match input?
verified = []
for ex in coherent:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content":
f'Is this label correct?\nText: "{ex["text"]}"\n'
f'Label: "{ex["label"]}"\nReply: correct or incorrect'}]
)
if "correct" in resp.choices[0].message.content.lower().split():
verified.append(ex)
print(f" After label verification: {len(verified)}")
# Stage 3: Embedding-based deduplication
embs = get_embeddings([ex["text"] for ex in verified])
sim_matrix = cosine_similarity(embs)
np.fill_diagonal(sim_matrix, 0)
keep = [True] * len(verified)
for i in range(len(verified)):
if not keep[i]:
continue
for j in range(i + 1, len(verified)):
if sim_matrix[i][j] > 0.95:
keep[j] = False # Mark duplicate for removal
deduped = [ex for ex, k in zip(verified, keep) if k]
print(f" After deduplication: {len(deduped)}")
# Stage 4: Difficulty calibration — remove too-easy examples
final = []
for ex in deduped:
baseline = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": ex["text"]}]
)
answer = baseline.choices[0].message.content
if not answer.strip().lower().startswith(ex["label"].lower()):
final.append(ex) # Keep what the baseline gets wrong
print(f" After difficulty filter: {len(final)}")
return final
# Typical result: 1000 raw -> 850 coherent -> 780 verified
# -> 650 deduped -> 550 difficulty-filtered
A cost-saving trick: use cheap models for filtering, expensive ones for generation. The coherence checker and label verifier above use gpt-4o-mini (roughly $0.15 per 1K input tokens) rather than gpt-4o ($2.50 per 1K). Filtering 1,000 examples costs about $0.50 with the mini model — a tiny fraction of the generation cost.
Stage 4 (difficulty calibration) is the most powerful but also the most aggressive. It removes examples that a weak baseline model can already handle, which means your training data focuses entirely on the hard cases where fine-tuning actually adds value. If you're losing too many examples here, lower the bar by using an even weaker baseline or skipping this stage.
Measuring Synthetic Data Quality
Before shipping a synthetic dataset to fine-tuning, you need to answer one question: is this data good enough? Gut feeling doesn't cut it. You need concrete metrics that tell you where the data is strong, where it's weak, and whether it's safe to train on.
The Quality Scorecard computes five metrics that together give a comprehensive picture:
def quality_scorecard(dataset, test_set=None, judge="gpt-4o"):
"""Compute comprehensive quality metrics for a synthetic dataset."""
# 1. Label Accuracy — sample and verify with a judge model
sample = random.sample(dataset, min(50, len(dataset)))
correct = sum(
1 for ex in sample
if "correct" in client.chat.completions.create(
model=judge,
messages=[{"role": "user", "content":
f'Is this label correct?\nText: "{ex["text"]}"\n'
f'Label: {ex["label"]}\nReply: correct or incorrect'}]
).choices[0].message.content.lower()
)
label_accuracy = correct / len(sample)
# 2. Diversity — average pairwise embedding distance
embs = get_embeddings([ex["text"] for ex in dataset])
pairwise = 1 - cosine_similarity(embs)
idx = np.triu_indices(len(dataset), k=1)
diversity = float(np.mean(pairwise[idx]))
# 3. Difficulty Distribution — spread of coherence scores
scores = [ex.get("coherence", 3) for ex in dataset]
difficulty_spread = float(np.std(scores))
# 4. Category Coverage
categories = set(
ex.get("label", ex.get("intent", "unknown")) for ex in dataset
)
# 5. Test Set Contamination Check
contamination = 0.0
if test_set:
test_embs = get_embeddings([t["text"] for t in test_set])
cross_sim = cosine_similarity(embs, test_embs)
contamination = float(np.mean(cross_sim.max(axis=1) > 0.9))
scorecard = {
"Label Accuracy": f"{label_accuracy:.0%}",
"Diversity Score": f"{diversity:.3f}",
"Difficulty Spread": f"{difficulty_spread:.2f}",
"Category Coverage": f"{len(categories)} categories",
"Contamination Risk": f"{contamination:.1%} overlap",
}
print("\nData Quality Scorecard")
print("=" * 44)
for metric, value in scorecard.items():
dots = "." * (32 - len(metric))
print(f" {metric} {dots} {value}")
return scorecard
The most overlooked metric is contamination. If your synthetic training examples are too similar to your test set, your evaluation metrics will be artificially inflated — the model isn't generalizing, it's memorizing. The contamination check catches this by computing cross-similarity between training and test embeddings.
Here's how the three generation techniques compare on the same task (customer intent classification, 1K examples each):
| Technique | Starting Data | Cost / 1K | Label Accuracy | Diversity | Best For |
|---|---|---|---|---|---|
| Self-Instruct | None | ~$0.25 | 75–82% | Moderate | Cold start, prototyping |
| Few-Shot Amplification | 10–50 seeds | ~$0.50 | 85–90% | Good | Domain-specific tasks |
| Evol-Instruct | Easy dataset | ~$1.00 | 82–88% | Very High | Difficulty / complexity boost |
| Full Pipeline | Minimal seeds | ~$2–5 | 90–95% | Highest | Production fine-tuning |
Notice that the Full Pipeline (combining all three techniques with quality filtering) achieves the highest accuracy and diversity, but costs 10–20x more than Self-Instruct alone. The right choice depends on your quality requirements and how much you're willing to spend on data versus inference — a recurring theme in model routing decisions.
Try It: Synthetic Data Lab
Select a task type and generation strategy, then see how quality filters affect the dataset. Green dots pass all active filters; red dots fail at least one.
End-to-End Workflow — From Zero to Fine-Tuned Model
Let's put the entire pipeline together. The complete workflow has five stages: bootstrap seeds with Self-Instruct (if needed), amplify with Few-Shot, evolve with Evol-Instruct for difficulty diversity, filter for quality, then export in fine-tuning format. Each stage feeds into the next, and the quality scorecard tells you whether the output is ready for training.
class SyntheticDataPipeline:
"""End-to-end: generate -> amplify -> evolve -> filter -> export."""
def __init__(self, task_description, seed_examples=None):
self.task = task_description
self.seeds = seed_examples or []
self.generated = []
self.filtered = []
def run(self, target_count=1000):
# Stage 1: Bootstrap with self-instruct if no seeds
if not self.seeds:
print("No seeds -- bootstrapping with self-instruct...")
self.seeds = self_instruct(self.task, target=50)
# Stage 2: Amplify seeds to target volume
print(f"Amplifying {len(self.seeds)} seeds...")
amplified = few_shot_amplify(self.seeds, target=target_count)
# Stage 3: Evolve a subset for difficulty diversity
novel = [e for e in amplified if e.get("novelty", 0) > 0.5]
print(f"Evolving {len(novel[:200])} examples for difficulty...")
evolved = evol_instruct(novel[:200], rounds=2)
self.generated = amplified + evolved
print(f"Total raw examples: {len(self.generated)}")
# Stage 4: Quality filter
self.filtered = quality_filter(self.generated)
print(f"After filtering: {len(self.filtered)}")
# Stage 5: Quality report
return self.filtered, quality_scorecard(self.filtered)
def export(self, path="training_data.jsonl"):
"""Export in JSONL format for fine-tuning."""
with open(path, "w") as f:
for ex in self.filtered:
f.write(json.dumps(ex) + "\n")
print(f"Exported {len(self.filtered)} examples to {path}")
# Run the full pipeline
pipeline = SyntheticDataPipeline(
task_description="Customer intent classification for e-commerce",
seed_examples=SEEDS[:20]
)
dataset, scores = pipeline.run(target_count=1000)
pipeline.export()
# Typical output:
# Amplifying 20 seeds...
# Generated 1023 examples (avg novelty: 0.64)
# Evolving 200 examples for difficulty...
# Evolved 200 -> 487 examples
# Total raw examples: 1510
# After filtering: 847
# Exported 847 examples to training_data.jsonl
The numbers tell the story: we start with 20 seed examples, amplify to ~1,000, add ~300 evolved variants, and after aggressive filtering end up with ~850 high-quality training examples. The cost? About $3–6 in API calls. Compare that to the $10,000+ you'd spend on human annotation for the same volume.
The key detail is the novelty threshold in Stage 3. We only evolve examples with novelty above 0.5, which means they're already somewhat diverse. Evolving near-copies of seeds wastes API calls and produces redundant hard examples. For the connection to fine-tuning, see the fine-tuning post — once you have this dataset, that post shows you how to train on it.
Production Patterns and Pitfalls
Everything above works beautifully in a notebook. Production introduces new failure modes that can silently degrade your fine-tuned model. Here are the big ones and how to defend against them:
Topic Drift
Synthetic data slowly diverges from real user queries. The model generates examples that are plausible but don't match actual usage patterns. Your classifier handles synthetic edge cases perfectly but stumbles on the boring, repetitive queries that make up 80% of real traffic.
Fix: Mix synthetic with real data. Even 10% real examples from production logs anchors the distribution. Periodically regenerate synthetic data using recent production queries as seeds.
Label Noise Amplification
When you fine-tune on wrong labels, the model doesn't just miss those examples — it learns the wrong pattern and applies it to similar inputs. A 5% label error rate in training data can cause a 15% accuracy drop on related examples.
Fix: Multi-model verification. Generate with Model A, verify labels with Model B. If two models disagree on a label, flag it for manual review rather than including it in the dataset.
Evaluation Contamination
This is the silent killer. If your synthetic training examples overlap with your test set, your evaluation metrics are meaningless — the model is memorizing, not generalizing. It looks great in your notebook and fails in production.
Fix: Always run the contamination check from the quality scorecard. Create your test set before generating training data, and explicitly exclude test-like examples during filtering. The evaluation post covers this in depth.
Mode Collapse
LLMs over-represent certain styles and patterns. Without diversity pressure, your synthetic dataset might contain 200 examples that all follow the same sentence structure, just with different nouns swapped in. The model learns the template, not the task.
Fix: Track embedding diversity during generation. If the average pairwise distance drops below a threshold, increase temperature or add explicit diversity prompts. The Evol-Instruct technique helps here by pushing examples in different complexity directions.
The most dangerous synthetic dataset is one that looks good on the surface but contains subtle, systematic errors. That's why the quality inspector skill — learning to read synthetic data critically — is worth practicing.
Try It: Data Quality Inspector
Below are 15 synthetic examples generated for a customer support intent classifier. Some have quality issues: wrong labels, near-duplicates, incoherent text, or off-task examples. Click "Flag Issue" on the ones you think are problematic, then reveal the answers to see how you did.
Choosing Your Approach
Not sure which technique to start with? Walk through this decision tree:
- Do you have ANY labeled data?
- No → Start with Self-Instruct. Generate 50–100 examples, manually review 20, fix systematic issues, then use the corrected set as seeds for Few-Shot Amplification.
- Yes → Continue to step 2.
- Do you have 10+ labeled examples?
- No → Self-Instruct with manual review to bootstrap up to 20 seeds, then Few-Shot Amplify.
- Yes → Go directly to Few-Shot Amplification.
- Do you need hard / complex examples?
- Yes → Add Evol-Instruct after amplification. Two evolution rounds typically produce a good difficulty spread.
- No → Skip to filtering.
- Is this for production fine-tuning?
- Yes → Run the Full Pipeline with all four quality filters. Budget for $3–6 per 1K examples. Consider hybrid validation: have a human review 10% of the filtered dataset.
- No → Self-Instruct or Few-Shot with basic deduplication is enough for prototyping.
And a few scenarios where synthetic data is not the right answer:
- You already have thousands of labeled examples. Fine-tune on real data first; add synthetic data only if you need more diversity or difficulty.
- The task requires real-world grounding. Generating synthetic medical records, legal documents, or financial data risks producing plausible-looking fiction that fails in practice.
- Regulatory compliance requires human-verified labels. In healthcare, finance, and safety-critical applications, synthetic labels may not satisfy audit requirements even if they're accurate.
References & Further Reading
- Wang et al. — Self-Instruct: Aligning Language Models with Self-Generated Instructions (ACL 2023) — the foundational paper on generating training data from task descriptions
- Xu et al. — WizardLM: Empowering Large Language Models to Follow Complex Instructions (ICLR 2024) — introduces Evol-Instruct for complexity evolution
- Taori et al. — Stanford Alpaca (2023) — demonstrated Self-Instruct at scale with 52K generated examples
- Mukherjee et al. — Orca: Progressive Learning from Complex Explanation Traces (Microsoft, 2023) — using step-by-step reasoning traces as synthetic training data
- Xu et al. — Magpie: Alignment Data Synthesis from Scratch (ICLR 2025) — generating instruction data by exploiting model auto-complete behavior
- Chan et al. — Scaling Synthetic Data Creation with 1B Personas (2024) — Persona Hub approach to diverse synthetic data through billions of persona descriptions
- Gunasekar et al. — Textbooks Are All You Need (Microsoft, 2023) — the Phi model series trained primarily on synthetic textbook-quality data
- Mitra et al. — AgentInstruct (Microsoft, 2024) — multi-agent synthetic data generation with agentic flows
- Ben Allal et al. — Cosmopedia (Hugging Face, 2024) — 25B tokens of synthetic textbook and blog content for pre-training
- Distilabel — Argilla — open-source framework for synthetic data generation and labeling
- InstructLab — Red Hat / IBM — community-driven synthetic data generation for LLM alignment