← Back to Blog

Synthetic Data Generation: Using LLMs to Build Your Own Training Datasets

The Data Bottleneck — Why Synthetic Data Changes Everything

You've read every fine-tuning tutorial. You understand LoRA, you've picked your base model, you've got your training script ready. Then you hit the wall that stops 90% of fine-tuning projects cold: you don't have any labeled training data.

This is the labeled data bottleneck, and it's the single biggest barrier between "I want to fine-tune a model" and actually doing it. Most teams have plenty of raw text — customer emails, support tickets, product descriptions — but almost no labeled pairs mapping inputs to desired outputs. And hiring human annotators to create those pairs? That's $5,000–$20,000 and 2–4 weeks per thousand examples.

Synthetic data generation flips this equation. Instead of paying humans to create training data, you use a strong model (like GPT-4o or Claude) to generate training data for a weaker, cheaper model. You're distilling intelligence into data — and the economics are staggering:

Approach Cost / 1K Examples Time Quality Best For
Human Annotation $5,000 – $20,000 2–4 weeks Gold standard Regulatory, safety-critical
LLM-Generated $10 – $50 2–4 hours Good (85–92%) Rapid prototyping, iteration
Hybrid (LLM + Human QA) $200 – $500 3–5 days Very high (95%+) Production fine-tuning

The hybrid approach is the sweet spot for most teams: generate thousands of examples with an LLM, then have a human review a 10–20% sample to catch systematic errors. This post walks through four progressively sophisticated generation techniques, shows you how to filter for quality, and ties them together into a complete pipeline you can run today.

Three common starting points determine which technique you need:

  1. Zero labeled data — you only have a task description. Use Self-Instruct (Section 2).
  2. A handful of seeds — you have 10–50 hand-labeled examples. Use Few-Shot Amplification (Section 3).
  3. An easy dataset — you have simple examples but need harder ones. Use Evol-Instruct (Section 4).

Self-Instruct — Generating from Task Descriptions Alone

The breakthrough insight of Self-Instruct (Wang et al., 2023) is beautifully simple: if a large language model can solve a task, it can also generate examples of that task. Describe what you want, and the model produces (input, output) pairs that match your description.

The algorithm works in a loop: start with a task description, generate a batch of examples, feed recent examples back as negative examples (to encourage diversity), filter duplicates, and repeat until you hit your target count. Stanford's Alpaca project used this approach to generate 52,000 instruction-following examples for under $500 — proving it works at scale.

The naive version has a problem, though: LLMs naturally generate repetitive, easy examples. Ask for 100 sentiment examples and you'll get 40 variations of "This product is great!" We fix this with diversity pressure — showing the model its own recent output and explicitly asking for different patterns:


import openai, json, random
from collections import Counter

client = openai.OpenAI()

TASK = """Generate diverse sentiment analysis training examples.
Each example: a JSON object with "text" and "label" fields.
Labels: positive, negative, neutral.
Vary length (5-50 words), domain (products, food, travel,
services, entertainment), and difficulty (obvious to subtle)."""

def self_instruct(task_desc, target=100, batch_size=5):
    """Generate labeled examples from a task description alone."""
    pool, seen = [], set()

    for _ in range(target // batch_size):
        # Show recent examples as negative examples for diversity
        recent = json.dumps(pool[-6:], indent=2) if pool else "[]"

        resp = client.chat.completions.create(
            model="gpt-4o-mini",   # Cheap model for bulk generation
            temperature=1.0,       # High temp = more diversity
            response_format={"type": "json_object"},
            messages=[
                {"role": "system", "content": task_desc},
                {"role": "user", "content": f"""Generate {batch_size} NEW examples.
They must be DIFFERENT from these existing ones:
{recent}

Mix: sarcasm, subtle opinions, mixed feelings, obvious cases.
Return JSON with an "examples" array."""}
            ]
        )

        batch = json.loads(resp.choices[0].message.content)
        for ex in batch.get("examples", []):
            key = ex["text"].strip().lower()
            if key not in seen:
                seen.add(key)
                pool.append(ex)

    labels = Counter(ex["label"] for ex in pool)
    print(f"Generated {len(pool)} examples: {dict(labels)}")
    return pool

# Generate 100 sentiment examples from scratch — no labeled data needed
examples = self_instruct(TASK, target=100)
# Cost: ~$0.03 with gpt-4o-mini (Feb 2026 pricing)
            

Two design choices make this work. First, high temperature (1.0) pushes the model toward less probable completions, which means more diverse examples. Second, showing recent output as negative examples creates an implicit diversity loop — the model sees what it already generated and actively avoids repeating those patterns.

Self-Instruct is your cold-start tool. No labeled data needed, just a well-written task description. The quality ceiling is moderate (75–82% label accuracy), but the cost floor is as low as it gets: roughly $0.03 per 100 examples with gpt-4o-mini.

Few-Shot Amplification — Scaling Seed Examples

Self-Instruct works when you're starting from zero. But if you have even 10–50 hand-labeled examples — from a pilot annotation, from your own manual labeling, or from production logs — you can do significantly better. Few-Shot Amplification uses your real examples as demonstrations, grounding the generated data in the actual style, difficulty, and distribution of your task.

The key technique is stratified random sampling: for each generation call, randomly select 3–5 seed examples from different categories. This prevents the model from fixating on one pattern while keeping it anchored to your real data. The result is synthetic data that feels like it came from the same distribution as your seeds — just hundreds of times more of it.


import random, json

SEEDS = [
    {"text": "I need to change my delivery address", "intent": "address_change"},
    {"text": "Where's my package? It's been two weeks", "intent": "order_status"},
    {"text": "Can I get a refund for the broken item?", "intent": "refund_request"},
    {"text": "How do I reset my password?", "intent": "account_help"},
    {"text": "Do you ship internationally?", "intent": "shipping_info"},
    {"text": "I want to cancel my subscription", "intent": "cancellation"},
    # ... imagine 14 more spanning all intent categories
]
INTENTS = list(set(s["intent"] for s in SEEDS))

def few_shot_amplify(seeds, target=500, shots=3):
    """Scale seed examples into a larger dataset via few-shot prompting."""
    synthetic = []

    for i in range(target // 5):
        # Stratified sampling: pick seeds from different intents
        sampled = []
        for _ in range(shots):
            intent = random.choice(INTENTS)
            candidates = [s for s in seeds if s["intent"] == intent]
            sampled.append(random.choice(candidates))

        examples_block = "\n".join(
            f'  "{s["text"]}" -> {s["intent"]}' for s in sampled
        )

        resp = client.chat.completions.create(
            model="gpt-4o",       # Stronger model for quality
            temperature=0.9,
            messages=[{"role": "user", "content": f"""Real customer messages:
{examples_block}

Generate 5 NEW messages with correct intents.
Same style and difficulty. Valid intents: {INTENTS}
Return a JSON array of {{"text": "...", "intent": "..."}} objects."""}]
        )

        batch = json.loads(resp.choices[0].message.content)
        items = batch if isinstance(batch, list) else batch.get("examples", [])
        for ex in items:
            if ex.get("intent") in INTENTS:
                # Novelty check: how different from nearest seed?
                seed_words = [set(s["text"].lower().split()) for s in seeds]
                ex_words = set(ex["text"].lower().split())
                max_overlap = max(
                    len(ex_words & sw) / max(len(ex_words), 1)
                    for sw in seed_words
                )
                ex["novelty"] = round(1.0 - max_overlap, 2)
                synthetic.append(ex)

    avg_novelty = sum(e["novelty"] for e in synthetic) / len(synthetic)
    print(f"Generated {len(synthetic)} examples (avg novelty: {avg_novelty:.2f})")
    return synthetic
            

The novelty score (0–1) measures word-level overlap between each generated example and its nearest seed. A score of 0.0 means it's a near-copy; 1.0 means completely novel vocabulary. In practice, you want examples between 0.3 and 0.8 — similar enough to be on-task, different enough to add training signal.

This is the diversity-fidelity tradeoff: too much diversity and the generated examples drift from your actual task; too little and you're just paraphrasing your seeds. Few-Shot Amplification typically produces 85–90% label accuracy — a meaningful jump over Self-Instruct — because the seeds ground the model in real examples.

Evol-Instruct — Making Simple Data Complex

Here's a pattern you'll notice in synthetic datasets: they're easy. LLMs naturally generate straightforward, unambiguous examples because those are the most common patterns in their training data. But easy examples carry little training signal — your model already handles them. The examples that actually improve a fine-tuned model are the hard ones: edge cases, ambiguous inputs, multi-step reasoning.

Evol-Instruct (Xu et al., 2023, from the WizardLM paper) solves this by systematically evolving simple examples into harder variants. The technique applies five "evolution operators" that each push complexity in a different direction:


EVOLUTION_OPS = {
    "add_constraints": "Add 2-3 specific constraints or conditions.",
    "deepen_reasoning": "Require multi-step reasoning or comparison.",
    "increase_specificity": "Add domain-specific context and details.",
    "inject_ambiguity": "Make it require clarification or interpretation.",
    "require_tradeoffs": "Reframe so the answer weighs pros and cons.",
}

def evolve_example(example, op_name, op_prompt):
    """Apply one evolution operator to increase difficulty."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.7,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": f"""Evolve this Q&A to be harder.

Original Question: {example["question"]}
Original Answer: {example["answer"]}

Evolution strategy: {op_prompt}
Return JSON with "question" and "answer" fields.
The evolved question must be harder but still answerable."""}]
    )
    return json.loads(resp.choices[0].message.content)

def evol_instruct(simple_examples, rounds=2):
    """Evolve simple examples into progressively harder variants."""
    all_examples = list(simple_examples)  # Keep originals
    current = simple_examples

    for round_num in range(rounds):
        evolved_batch = []
        operators = list(EVOLUTION_OPS.items())

        for ex in current:
            op_name, op_prompt = random.choice(operators)
            evolved = evolve_example(ex, op_name, op_prompt)
            evolved["round"] = round_num + 1
            evolved["operator"] = op_name
            evolved_batch.append(evolved)

        # Quality gate: reject if too similar to original
        for orig, evol in zip(current, evolved_batch):
            orig_words = set(orig["question"].lower().split())
            evol_words = set(evol["question"].lower().split())
            overlap = len(orig_words & evol_words) / len(orig_words | evol_words)
            if overlap < 0.7:  # Sufficiently different
                all_examples.append(evol)

        current = evolved_batch  # Feed evolved into next round

    print(f"Evolved {len(simple_examples)} -> {len(all_examples)} examples")
    return all_examples
            

Watch the progression. A simple question like "What is machine learning?" evolves through rounds into something like "Compare supervised and unsupervised approaches for time-series anomaly detection in IoT sensor data, considering labeled data availability and real-time inference constraints." Each round makes it harder, and the quality gate rejects evolutions that don't actually add complexity (measured by word-level change).

The key insight: run evolution in rounds, feeding each round's output as the next round's input. Two rounds typically produce a good difficulty spread — easy originals, medium first-round evolutions, and hard second-round evolutions. This gives your fine-tuned model training signal across the full difficulty spectrum.

Quality Filtering — The Make-or-Break Step

Here's the counterintuitive truth about synthetic data: the generation step is the easy part. The filtering step determines whether your dataset actually works.

Raw synthetic data is noisy. In a typical batch of 1,000 generated examples, you'll find wrong labels (the model assigned "positive" to a clearly negative text), incoherent garbage (the model hallucinated a sentence fragment), near-duplicates (three examples that all say essentially the same thing), and trivially easy cases (the model generated simple examples that any baseline can already handle). Filtering these out is essential — 600 high-quality examples will outperform 1,000 unfiltered ones every time.

We build a four-stage filter pipeline, each stage catching a different failure mode:


import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def get_embeddings(texts, model="text-embedding-3-small"):
    """Get embeddings for a list of texts."""
    resp = client.embeddings.create(input=texts, model=model)
    return np.array([e.embedding for e in resp.data])

def quality_filter(examples, model="gpt-4o-mini"):
    """Four-stage quality filter for synthetic data."""
    print(f"Starting with {len(examples)} examples")

    # Stage 1: Coherence — is the example well-formed?
    coherent = []
    for ex in examples:
        resp = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content":
                f'Rate this example\'s coherence 1-5.\n'
                f'Text: "{ex["text"]}"\nReply with just the number.'}]
        )
        score = int(resp.choices[0].message.content.strip())
        if score >= 3:
            ex["coherence"] = score
            coherent.append(ex)
    print(f"  After coherence filter: {len(coherent)}")

    # Stage 2: Label verification — does output match input?
    verified = []
    for ex in coherent:
        resp = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content":
                f'Is this label correct?\nText: "{ex["text"]}"\n'
                f'Label: "{ex["label"]}"\nReply: correct or incorrect'}]
        )
        if "correct" in resp.choices[0].message.content.lower().split():
            verified.append(ex)
    print(f"  After label verification: {len(verified)}")

    # Stage 3: Embedding-based deduplication
    embs = get_embeddings([ex["text"] for ex in verified])
    sim_matrix = cosine_similarity(embs)
    np.fill_diagonal(sim_matrix, 0)
    keep = [True] * len(verified)
    for i in range(len(verified)):
        if not keep[i]:
            continue
        for j in range(i + 1, len(verified)):
            if sim_matrix[i][j] > 0.95:
                keep[j] = False  # Mark duplicate for removal
    deduped = [ex for ex, k in zip(verified, keep) if k]
    print(f"  After deduplication: {len(deduped)}")

    # Stage 4: Difficulty calibration — remove too-easy examples
    final = []
    for ex in deduped:
        baseline = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": ex["text"]}]
        )
        answer = baseline.choices[0].message.content
        if not answer.strip().lower().startswith(ex["label"].lower()):
            final.append(ex)  # Keep what the baseline gets wrong
    print(f"  After difficulty filter: {len(final)}")
    return final

# Typical result: 1000 raw -> 850 coherent -> 780 verified
#   -> 650 deduped -> 550 difficulty-filtered
            

A cost-saving trick: use cheap models for filtering, expensive ones for generation. The coherence checker and label verifier above use gpt-4o-mini (roughly $0.15 per 1K input tokens) rather than gpt-4o ($2.50 per 1K). Filtering 1,000 examples costs about $0.50 with the mini model — a tiny fraction of the generation cost.

Stage 4 (difficulty calibration) is the most powerful but also the most aggressive. It removes examples that a weak baseline model can already handle, which means your training data focuses entirely on the hard cases where fine-tuning actually adds value. If you're losing too many examples here, lower the bar by using an even weaker baseline or skipping this stage.

Measuring Synthetic Data Quality

Before shipping a synthetic dataset to fine-tuning, you need to answer one question: is this data good enough? Gut feeling doesn't cut it. You need concrete metrics that tell you where the data is strong, where it's weak, and whether it's safe to train on.

The Quality Scorecard computes five metrics that together give a comprehensive picture:


def quality_scorecard(dataset, test_set=None, judge="gpt-4o"):
    """Compute comprehensive quality metrics for a synthetic dataset."""

    # 1. Label Accuracy — sample and verify with a judge model
    sample = random.sample(dataset, min(50, len(dataset)))
    correct = sum(
        1 for ex in sample
        if "correct" in client.chat.completions.create(
            model=judge,
            messages=[{"role": "user", "content":
                f'Is this label correct?\nText: "{ex["text"]}"\n'
                f'Label: {ex["label"]}\nReply: correct or incorrect'}]
        ).choices[0].message.content.lower()
    )
    label_accuracy = correct / len(sample)

    # 2. Diversity — average pairwise embedding distance
    embs = get_embeddings([ex["text"] for ex in dataset])
    pairwise = 1 - cosine_similarity(embs)
    idx = np.triu_indices(len(dataset), k=1)
    diversity = float(np.mean(pairwise[idx]))

    # 3. Difficulty Distribution — spread of coherence scores
    scores = [ex.get("coherence", 3) for ex in dataset]
    difficulty_spread = float(np.std(scores))

    # 4. Category Coverage
    categories = set(
        ex.get("label", ex.get("intent", "unknown")) for ex in dataset
    )

    # 5. Test Set Contamination Check
    contamination = 0.0
    if test_set:
        test_embs = get_embeddings([t["text"] for t in test_set])
        cross_sim = cosine_similarity(embs, test_embs)
        contamination = float(np.mean(cross_sim.max(axis=1) > 0.9))

    scorecard = {
        "Label Accuracy":     f"{label_accuracy:.0%}",
        "Diversity Score":    f"{diversity:.3f}",
        "Difficulty Spread":  f"{difficulty_spread:.2f}",
        "Category Coverage":  f"{len(categories)} categories",
        "Contamination Risk": f"{contamination:.1%} overlap",
    }

    print("\nData Quality Scorecard")
    print("=" * 44)
    for metric, value in scorecard.items():
        dots = "." * (32 - len(metric))
        print(f"  {metric} {dots} {value}")
    return scorecard
            

The most overlooked metric is contamination. If your synthetic training examples are too similar to your test set, your evaluation metrics will be artificially inflated — the model isn't generalizing, it's memorizing. The contamination check catches this by computing cross-similarity between training and test embeddings.

Here's how the three generation techniques compare on the same task (customer intent classification, 1K examples each):

Technique Starting Data Cost / 1K Label Accuracy Diversity Best For
Self-Instruct None ~$0.25 75–82% Moderate Cold start, prototyping
Few-Shot Amplification 10–50 seeds ~$0.50 85–90% Good Domain-specific tasks
Evol-Instruct Easy dataset ~$1.00 82–88% Very High Difficulty / complexity boost
Full Pipeline Minimal seeds ~$2–5 90–95% Highest Production fine-tuning

Notice that the Full Pipeline (combining all three techniques with quality filtering) achieves the highest accuracy and diversity, but costs 10–20x more than Self-Instruct alone. The right choice depends on your quality requirements and how much you're willing to spend on data versus inference — a recurring theme in model routing decisions.

Try It: Synthetic Data Lab

Select a task type and generation strategy, then see how quality filters affect the dataset. Green dots pass all active filters; red dots fail at least one.

End-to-End Workflow — From Zero to Fine-Tuned Model

Let's put the entire pipeline together. The complete workflow has five stages: bootstrap seeds with Self-Instruct (if needed), amplify with Few-Shot, evolve with Evol-Instruct for difficulty diversity, filter for quality, then export in fine-tuning format. Each stage feeds into the next, and the quality scorecard tells you whether the output is ready for training.


class SyntheticDataPipeline:
    """End-to-end: generate -> amplify -> evolve -> filter -> export."""

    def __init__(self, task_description, seed_examples=None):
        self.task = task_description
        self.seeds = seed_examples or []
        self.generated = []
        self.filtered = []

    def run(self, target_count=1000):
        # Stage 1: Bootstrap with self-instruct if no seeds
        if not self.seeds:
            print("No seeds -- bootstrapping with self-instruct...")
            self.seeds = self_instruct(self.task, target=50)

        # Stage 2: Amplify seeds to target volume
        print(f"Amplifying {len(self.seeds)} seeds...")
        amplified = few_shot_amplify(self.seeds, target=target_count)

        # Stage 3: Evolve a subset for difficulty diversity
        novel = [e for e in amplified if e.get("novelty", 0) > 0.5]
        print(f"Evolving {len(novel[:200])} examples for difficulty...")
        evolved = evol_instruct(novel[:200], rounds=2)

        self.generated = amplified + evolved
        print(f"Total raw examples: {len(self.generated)}")

        # Stage 4: Quality filter
        self.filtered = quality_filter(self.generated)
        print(f"After filtering: {len(self.filtered)}")

        # Stage 5: Quality report
        return self.filtered, quality_scorecard(self.filtered)

    def export(self, path="training_data.jsonl"):
        """Export in JSONL format for fine-tuning."""
        with open(path, "w") as f:
            for ex in self.filtered:
                f.write(json.dumps(ex) + "\n")
        print(f"Exported {len(self.filtered)} examples to {path}")

# Run the full pipeline
pipeline = SyntheticDataPipeline(
    task_description="Customer intent classification for e-commerce",
    seed_examples=SEEDS[:20]
)
dataset, scores = pipeline.run(target_count=1000)
pipeline.export()

# Typical output:
# Amplifying 20 seeds...
# Generated 1023 examples (avg novelty: 0.64)
# Evolving 200 examples for difficulty...
# Evolved 200 -> 487 examples
# Total raw examples: 1510
# After filtering: 847
# Exported 847 examples to training_data.jsonl
            

The numbers tell the story: we start with 20 seed examples, amplify to ~1,000, add ~300 evolved variants, and after aggressive filtering end up with ~850 high-quality training examples. The cost? About $3–6 in API calls. Compare that to the $10,000+ you'd spend on human annotation for the same volume.

The key detail is the novelty threshold in Stage 3. We only evolve examples with novelty above 0.5, which means they're already somewhat diverse. Evolving near-copies of seeds wastes API calls and produces redundant hard examples. For the connection to fine-tuning, see the fine-tuning post — once you have this dataset, that post shows you how to train on it.

Production Patterns and Pitfalls

Everything above works beautifully in a notebook. Production introduces new failure modes that can silently degrade your fine-tuned model. Here are the big ones and how to defend against them:

Topic Drift

Synthetic data slowly diverges from real user queries. The model generates examples that are plausible but don't match actual usage patterns. Your classifier handles synthetic edge cases perfectly but stumbles on the boring, repetitive queries that make up 80% of real traffic.

Fix: Mix synthetic with real data. Even 10% real examples from production logs anchors the distribution. Periodically regenerate synthetic data using recent production queries as seeds.

Label Noise Amplification

When you fine-tune on wrong labels, the model doesn't just miss those examples — it learns the wrong pattern and applies it to similar inputs. A 5% label error rate in training data can cause a 15% accuracy drop on related examples.

Fix: Multi-model verification. Generate with Model A, verify labels with Model B. If two models disagree on a label, flag it for manual review rather than including it in the dataset.

Evaluation Contamination

This is the silent killer. If your synthetic training examples overlap with your test set, your evaluation metrics are meaningless — the model is memorizing, not generalizing. It looks great in your notebook and fails in production.

Fix: Always run the contamination check from the quality scorecard. Create your test set before generating training data, and explicitly exclude test-like examples during filtering. The evaluation post covers this in depth.

Mode Collapse

LLMs over-represent certain styles and patterns. Without diversity pressure, your synthetic dataset might contain 200 examples that all follow the same sentence structure, just with different nouns swapped in. The model learns the template, not the task.

Fix: Track embedding diversity during generation. If the average pairwise distance drops below a threshold, increase temperature or add explicit diversity prompts. The Evol-Instruct technique helps here by pushing examples in different complexity directions.

The most dangerous synthetic dataset is one that looks good on the surface but contains subtle, systematic errors. That's why the quality inspector skill — learning to read synthetic data critically — is worth practicing.

Try It: Data Quality Inspector

Below are 15 synthetic examples generated for a customer support intent classifier. Some have quality issues: wrong labels, near-duplicates, incoherent text, or off-task examples. Click "Flag Issue" on the ones you think are problematic, then reveal the answers to see how you did.

Flagged: 0 / 15  |  Start flagging issues below

Choosing Your Approach

Not sure which technique to start with? Walk through this decision tree:

  1. Do you have ANY labeled data?
    • No → Start with Self-Instruct. Generate 50–100 examples, manually review 20, fix systematic issues, then use the corrected set as seeds for Few-Shot Amplification.
    • Yes → Continue to step 2.
  2. Do you have 10+ labeled examples?
    • No → Self-Instruct with manual review to bootstrap up to 20 seeds, then Few-Shot Amplify.
    • Yes → Go directly to Few-Shot Amplification.
  3. Do you need hard / complex examples?
    • Yes → Add Evol-Instruct after amplification. Two evolution rounds typically produce a good difficulty spread.
    • No → Skip to filtering.
  4. Is this for production fine-tuning?
    • Yes → Run the Full Pipeline with all four quality filters. Budget for $3–6 per 1K examples. Consider hybrid validation: have a human review 10% of the filtered dataset.
    • No → Self-Instruct or Few-Shot with basic deduplication is enough for prototyping.

And a few scenarios where synthetic data is not the right answer:

References & Further Reading