Structured Output from LLMs: Getting JSON Every Time

February 25, 2026 · Applied AI · 14 min read

The JSON Wall

You've built a prototype. The LLM extracts data beautifully in the playground. You deploy it. Within an hour, your pipeline crashes because the model returned {"name": "John", "age": "twenty-seven"} instead of {"name": "John", "age": 27}.

Welcome to the JSON wall. Every developer building with LLMs hits it.

The core tension is simple: LLMs generate tokens sequentially, optimizing for the most plausible next character. They don't "understand" JSON schema. They're autocompleting text that happens to look like JSON. Most of the time this works. But "most of the time" is not good enough when your production pipeline expects valid, typed, schema-compliant JSON on every single call.

This post walks through four progressively more robust approaches to getting structured output from LLMs — from prompt-only (fragile) to Pydantic + instructor (production-grade). Every approach includes working code for both OpenAI and Anthropic, an honest assessment of failure modes, and clear guidance on when to use each one.

In our receipt parser post, we quietly used response_format=json_object to extract structured data from receipt images. Today we go deep on why that works and what to do when it doesn't.

Approach 1 of 4

Why Plain Prompting Breaks

The naive approach looks reasonable. You write a prompt like "Extract the person's name and age as JSON" and it often works. Here's what that looks like:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{
        "role": "user",
        "content": """Extract the person info as JSON with fields: name (string), age (integer).
Text: "John Smith is 27 years old and lives in Portland."
Return ONLY the JSON, no other text."""
    }]
)

print(response.choices[0].message.content)
# Sometimes: {"name": "John Smith", "age": 27}
# Sometimes: something else entirely

The problem is "sometimes." Here are the five failure modes you'll hit, roughly in order of frequency:

# Failure 1: Markdown wrapping
# Model returns fenced code instead of raw JSON
"""```json
{"name": "John Smith", "age": 27}
```"""

# Failure 2: Extra commentary
"""Sure! Here's the extracted data:
{"name": "John Smith", "age": 27}
Hope that helps!"""

# Failure 3: Schema violations
{"name": "John Smith", "age": "twenty-seven"}   # string instead of int
{"name": "John Smith"}                           # missing field
{"name": "John Smith", "age": 27, "city": "Portland"}  # extra field

# Failure 4: Partial JSON (token limit hit)
{"name": "John Smith", "age":  # truncated mid-value

# Failure 5: Hallucinated fields
{"name": "John Smith", "age": 27, "occupation": "engineer"}  # invented

Failure Mode	Frequency	Severity	Fixable with Regex?
Markdown wrapping	~30% of calls	Low	Yes — strip ```json...```
Extra commentary	~15% of calls	Medium	Fragile — find first { to last }
Wrong types	~5-10% of calls	High	No — valid JSON, wrong schema
Partial JSON	~1-2% of calls	Critical	No — incomplete data
Hallucinated fields	~5% of calls	Medium	Partially — can strip unknown keys

You can get prompt-only to work about 90% of the time with careful engineering — system messages, few-shot examples, "Return ONLY JSON" in all caps. But 90% means 1 in 10 API calls returns something your parser can't handle. At 10,000 calls per day, that's 1,000 crashes. You need something better.

Approach 2 of 4

JSON Mode — Constrained Decoding

OpenAI offers two levels of JSON enforcement. The first, response_format={"type": "json_object"}, guarantees syntactically valid JSON. The second, response_format={"type": "json_schema", ...}, guarantees valid JSON that matches your schema.

How does this work under the hood? During token generation, the model's logit distribution is masked to only allow tokens that produce valid JSON. If you've read our softmax post, you know that after softmax produces probabilities over the vocabulary, the model samples a token. Constrained decoding zeros out any token that would break the JSON structure before sampling happens. An open brace must eventually be closed. A string value must be quoted. A number can't contain letters.

Here's basic JSON mode — guarantees valid JSON, but doesn't enforce any particular schema:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract person info. Return JSON with: name (string), age (integer)."},
        {"role": "user", "content": "John Smith is 27 years old and lives in Portland."}
    ],
    response_format={"type": "json_object"}  # guarantees valid JSON
)

import json
data = json.loads(response.choices[0].message.content)  # always parses
print(data)
# {"name": "John Smith", "age": 27}

And here's the stricter JSON Schema mode — the model's output is guaranteed to match your exact schema:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract person info from the text."},
        {"role": "user", "content": "John Smith is 27 years old and lives in Portland."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"],
                "additionalProperties": False
            }
        }
    }
)

data = json.loads(response.choices[0].message.content)
# Guaranteed: {"name": , "age": }

Anthropic doesn't have native JSON mode in the same way. But you can use the prefill trick — start the assistant's response with an opening brace, and the model will continue from there:

import anthropic
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    messages=[
        {"role": "user", "content": "Extract name (string) and age (integer) as JSON.\nText: John Smith is 27."},
        {"role": "assistant", "content": "{"}  # prefill trick — force JSON start
    ]
)

import json
data = json.loads("{" + response.content[0].text)  # prepend the brace we started with
print(data)
# {"name": "John Smith", "age": 27}

The gotchas with JSON mode:

Valid JSON ≠ correct data. The model can output {"age": -5} — syntactically perfect JSON, semantically nonsensical.
Basic JSON mode doesn't enforce schema. You get valid JSON but might get {"person_name": "John"} instead of {"name": "John"}.
Schema mode is newer and not available on all models or providers.
Slight latency overhead. Constrained decoding adds ~10-20% to generation time because the model checks each token against the grammar.

Key insight: JSON mode solves the syntactic problem (no more markdown wrapping, no more commentary, no more truncated JSON). But it doesn't solve the semantic problem. For that, you need validation — which we'll get to in Section 6.

Approach 3 of 4

Tool Use / Function Calling

Here's a reframing that changes everything: instead of asking the model to "output JSON," you define a function with typed parameters and let the model "call" it. The model fills in the function arguments, and the API returns them as structured data.

This is how Claude and GPT-4 handle structured extraction in production. The model isn't generating free text that happens to look like JSON — it's filling in a function call, which is a task it's been specifically trained to do.

Here's the Anthropic approach using tool use:

import anthropic, json
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=256,
    tools=[{
        "name": "extract_person",
        "description": "Extract person information from text",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Full name"},
                "age": {"type": "integer", "description": "Age in years"}
            },
            "required": ["name", "age"]
        }
    }],
    tool_choice={"type": "tool", "name": "extract_person"},  # force this tool
    messages=[
        {"role": "user", "content": "John Smith is 27 years old and lives in Portland."}
    ]
)

# The response contains a tool_use block with structured arguments
tool_input = response.content[0].input
print(tool_input)
# {"name": "John Smith", "age": 27}

And the OpenAI equivalent using function calling:

from openai import OpenAI
import json
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "John Smith is 27 years old and lives in Portland."}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "extract_person",
            "description": "Extract person information from text",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"]
            }
        }
    }],
    tool_choice={"type": "function", "function": {"name": "extract_person"}}
)

tool_args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(tool_args)
# {"name": "John Smith", "age": 27}

Why is tool use better than raw JSON mode?

Stronger schema adherence. The model has been trained specifically on function-calling patterns, so it produces the right types more reliably than free-form JSON generation.
Complex nested schemas. Tool use handles deeply nested objects, arrays of objects, and optional fields more gracefully.
Wider availability. Tool use works across almost all major LLM providers. JSON Schema mode is OpenAI-specific.
Semantic clarity. Descriptions on each parameter give the model more context about what each field should contain.

The tradeoffs:

Cost. Tool definitions are injected into the prompt as tokens. A complex schema can add 200-500 tokens to every request.
Still wrong values. The types will be right (age will be an integer), but the values can still be wrong (age: 270).
Forced vs optional. Without tool_choice, the model might decide not to call the tool at all. Always force it for extraction tasks.

Approach 4 of 4

Pydantic + Instructor — The Production Pattern

The instructor library, by Jason Liu, is the missing piece. It wraps the API client, maps your schema to a Pydantic model, and adds automatic validation with retry. If the model returns data that fails validation, instructor sends the error message back to the model and asks it to try again.

Start with a Pydantic model that defines your schema with type annotations and validators:

from pydantic import BaseModel, Field, field_validator
from typing import Optional

class Person(BaseModel):
    name: str = Field(description="Full name of the person")
    age: int = Field(description="Age in years", ge=0, le=150)
    email: Optional[str] = Field(default=None, description="Email if mentioned")

    @field_validator("name")
    @classmethod
    def name_not_empty(cls, v):
        if not v.strip():
            raise ValueError("Name cannot be empty")
        return v.strip()

Now patch the OpenAI client with instructor and make the call:

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

person = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Person,       # your Pydantic model
    max_retries=3,               # retry on validation failure
    messages=[
        {"role": "user", "content": "John Smith is 27 years old. Email: [email protected]"}
    ]
)

print(person)           # name='John Smith' age=27 email='[email protected]'
print(person.model_dump())  # {"name": "John Smith", "age": 27, "email": "[email protected]"}

The same thing works with Anthropic:

import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(Anthropic())

person = client.messages.create(
    model="claude-sonnet-4-6",
    response_model=Person,
    max_retries=3,
    max_tokens=256,
    messages=[
        {"role": "user", "content": "John Smith is 27 years old. Email: [email protected]"}
    ]
)

print(person.name)   # John Smith
print(person.age)    # 27

The magic is in the retry. Here's what happens when validation fails:

Model returns {"name": "John", "age": -5, "email": null}
Pydantic validates and catches: age: Value must be >= 0
Instructor sends a new message to the model: "The previous response failed validation. Error: age — Value must be >= 0. Please fix and try again."
Model corrects: {"name": "John", "age": 27, "email": null}
Pydantic validates — passes. Return the object.

Advanced patterns that instructor handles naturally:

Nested models — Person with a list[Address] field, each Address being its own Pydantic model.
Union types — A field that can be str | int, letting the model choose the appropriate type.
Computed fields — Fields derived from other fields, validated after extraction.
Enums — Restrict a field to specific values: status: Literal["active", "inactive"].

Cost note: Each retry is another API call. With max_retries=3, a worst-case extraction costs 4x a single call. But a failed extraction that crashes your pipeline, requires manual intervention, or serves bad data to users costs far more. Retries are cheap insurance.

The Validation Layer You Still Need

Even with instructor and Pydantic, you need validation beyond schema compliance. There are three levels of correctness:

Syntactic — Is it valid JSON? Handled by JSON mode and tool use.
Schema — Does it match the expected types and structure? Handled by Pydantic.
Semantic — Does it make sense? This one is on you.

Semantic validation catches the errors that pass every other check — a receipt total that doesn't match its line items, a date of birth in the future, a price that's negative. Here's a validation pipeline that handles all three levels:

from pydantic import BaseModel, Field, field_validator, model_validator
from typing import List

class LineItem(BaseModel):
    description: str
    price: float = Field(ge=0)

class Receipt(BaseModel):
    store_name: str
    items: List[LineItem]
    subtotal: float = Field(ge=0)
    tax: float = Field(ge=0)
    total: float = Field(ge=0)

    @field_validator("store_name")
    @classmethod
    def store_not_empty(cls, v):
        if not v.strip():
            raise ValueError("Store name cannot be empty")
        return v.strip()

    @model_validator(mode="after")
    def check_total_matches(self):
        """Semantic validation: total should equal subtotal + tax"""
        expected = round(self.subtotal + self.tax, 2)
        if abs(self.total - expected) > 0.02:  # 2 cent tolerance
            raise ValueError(
                f"Total ${self.total:.2f} doesn't match "
                f"subtotal ${self.subtotal:.2f} + tax ${self.tax:.2f} = ${expected:.2f}"
            )
        return self

    @model_validator(mode="after")
    def check_items_sum(self):
        """Semantic validation: line items should sum to subtotal"""
        items_sum = round(sum(item.price for item in self.items), 2)
        if abs(items_sum - self.subtotal) > 0.05:  # 5 cent tolerance
            raise ValueError(
                f"Line items sum to ${items_sum:.2f} "
                f"but subtotal is ${self.subtotal:.2f}"
            )
        return self

The retry pattern becomes powerful with semantic validation. When the model gets the arithmetic wrong, instructor sends the specific error back:

# What instructor sends to the model on retry:
"""The previous response failed validation:
- Total $45.00 doesn't match subtotal $42.50 + tax $3.40 = $45.90

Please fix the extraction and try again, ensuring the math is correct."""

# The model now has targeted feedback — not just "try again" but
# "your total is wrong, here's what it should be"
# This succeeds on retry ~95% of the time

In our receipt parser, this arithmetic validation caught errors in ~3% of extractions. Without it, those errors would have silently corrupted the database. Three percent sounds small until you realize it's 30 bad records per 1,000 receipts.

Temperature × Structure

If you've read our softmax & temperature post, you know that temperature controls the "sharpness" of the probability distribution over tokens. High temperature = more randomness = more creativity. Low temperature = more deterministic = more predictable.

For structured extraction, this has a direct consequence: higher temperature means more schema violations.

# Same extraction at different temperatures
# (simulated results from 100 calls each)

temperatures = {
    0.0: {"valid_json": "100%", "correct_schema": "99.2%", "correct_values": "97.1%"},
    0.3: {"valid_json": "100%", "correct_schema": "98.5%", "correct_values": "94.8%"},
    0.7: {"valid_json": "99.8%", "correct_schema": "95.1%", "correct_values": "88.3%"},
    1.0: {"valid_json": "99.2%", "correct_schema": "89.7%", "correct_values": "79.6%"},
}

# With JSON mode enabled, valid_json stays near 100% at all temperatures.
# But correct_values degrades because temperature affects WHICH valid
# tokens are chosen — "age": 27 vs "age": 28 vs "age": 270

Temperature	Valid JSON	Correct Schema	Correct Values
0.0	100%	99.2%	97.1%
0.3	100%	98.5%	94.8%
0.7	99.8%	95.1%	88.3%
1.0	99.2%	89.7%	79.6%

Rule of thumb: Use temperature 0 for structured extraction. You want the most likely valid completion, not creative diversity. Even with constrained decoding that guarantees valid JSON, temperature still affects which valid token gets chosen — and at high temperature, "age": 27 can become "age": 28 or "age": 270.

The exception: when you want diverse extractions (brainstorming categories, generating varied examples), use a moderate temperature (0.3-0.5) with validation to catch outliers.

The Decision Tree

Here's the cheat sheet. When you need structured output from an LLM, walk this tree:

How complex is your schema? ├── Simple (flat, 3-5 fields) │ └── How reliable does it need to be? │ ├── Prototype / one-off → JSON mode │ └── Production → Instructor + Pydantic │ └── Complex (nested, arrays, 10+ fields) └── Which provider? ├── OpenAI → JSON Schema mode + Pydantic validation ├── Anthropic → Tool use + Pydantic validation └── Multiple / any → Instructor (handles both)

And the quick reference:

Approach	Reliability	Latency	Cost	Complexity
Prompt-only	~90%	Baseline	Lowest	None
JSON mode	~99% syntax	+10-20%	Same	Low
Tool use	~99% schema	+10-20%	+schema tokens	Medium
Instructor	~99.9%	+retry cost	+retry calls	Medium

For most production use cases, the answer is instructor. It handles both providers, adds validation with retry, and the Pydantic model serves as living documentation of your schema. The only reason to use a simpler approach is if you're prototyping and don't want the dependency.

Try It: Schema Validator

Pick a schema, edit the JSON, and validate. See all three levels — syntactic, schema, and semantic — in action.

Examples:

Schema Definition

JSON Input

The demo above runs entirely in your browser — no API calls. It demonstrates the same three-level validation pipeline you'd build in production: parse the JSON (syntactic), check it against the schema (types and required fields), then run semantic checks (business logic). When validation fails, the retry prompt shows exactly what instructor would send back to the model.

Ship It

The path from "the LLM sometimes returns JSON" to "the LLM always returns validated, typed, schema-compliant data" is four steps: prompt-only gets you to 90%, JSON mode gets you to 99% syntactic, tool use gets you to 99% schema, and instructor + Pydantic gets you the rest of the way with automatic retry. Each step is more code and more cost, but each step also eliminates a class of failure that would otherwise crash your pipeline at 3 AM.

Start with the simplest approach that meets your reliability needs. For a prototype, JSON mode is fine. For production, instructor is the answer. And whatever you choose, set temperature to 0 and add semantic validation. Your future self — the one not getting paged about malformed JSON — will thank you.

References & Further Reading

OpenAI — Structured Outputs — Official documentation for JSON mode and JSON Schema mode
Anthropic — Tool Use — Claude's approach to structured extraction via function calling
Jason Liu — Instructor — The Pydantic-based library for validated LLM extraction
Pydantic Documentation — Data validation and settings management using Python type annotations
Outlines — Grammar-based constrained decoding for open-source models
DadOps — Using LLMs to Parse Grocery Receipts — A practical application of structured extraction
DadOps — Softmax & Temperature from Scratch — How temperature affects token selection