Structured Output from LLMs: Getting JSON Every Time
The JSON Wall
You've built a prototype. The LLM extracts data beautifully in the playground. You deploy it. Within an hour, your pipeline crashes because the model returned {"name": "John", "age": "twenty-seven"} instead of {"name": "John", "age": 27}.
Welcome to the JSON wall. Every developer building with LLMs hits it.
The core tension is simple: LLMs generate tokens sequentially, optimizing for the most plausible next character. They don't "understand" JSON schema. They're autocompleting text that happens to look like JSON. Most of the time this works. But "most of the time" is not good enough when your production pipeline expects valid, typed, schema-compliant JSON on every single call.
This post walks through four progressively more robust approaches to getting structured output from LLMs — from prompt-only (fragile) to Pydantic + instructor (production-grade). Every approach includes working code for both OpenAI and Anthropic, an honest assessment of failure modes, and clear guidance on when to use each one.
In our receipt parser post, we quietly used response_format=json_object to extract structured data from receipt images. Today we go deep on why that works and what to do when it doesn't.
Why Plain Prompting Breaks
The naive approach looks reasonable. You write a prompt like "Extract the person's name and age as JSON" and it often works. Here's what that looks like:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": """Extract the person info as JSON with fields: name (string), age (integer).
Text: "John Smith is 27 years old and lives in Portland."
Return ONLY the JSON, no other text."""
}]
)
print(response.choices[0].message.content)
# Sometimes: {"name": "John Smith", "age": 27}
# Sometimes: something else entirely
The problem is "sometimes." Here are the five failure modes you'll hit, roughly in order of frequency:
# Failure 1: Markdown wrapping
# Model returns fenced code instead of raw JSON
"""```json
{"name": "John Smith", "age": 27}
```"""
# Failure 2: Extra commentary
"""Sure! Here's the extracted data:
{"name": "John Smith", "age": 27}
Hope that helps!"""
# Failure 3: Schema violations
{"name": "John Smith", "age": "twenty-seven"} # string instead of int
{"name": "John Smith"} # missing field
{"name": "John Smith", "age": 27, "city": "Portland"} # extra field
# Failure 4: Partial JSON (token limit hit)
{"name": "John Smith", "age": # truncated mid-value
# Failure 5: Hallucinated fields
{"name": "John Smith", "age": 27, "occupation": "engineer"} # invented
| Failure Mode | Frequency | Severity | Fixable with Regex? |
|---|---|---|---|
| Markdown wrapping | ~30% of calls | Low | Yes — strip ```json...``` |
| Extra commentary | ~15% of calls | Medium | Fragile — find first { to last } |
| Wrong types | ~5-10% of calls | High | No — valid JSON, wrong schema |
| Partial JSON | ~1-2% of calls | Critical | No — incomplete data |
| Hallucinated fields | ~5% of calls | Medium | Partially — can strip unknown keys |
You can get prompt-only to work about 90% of the time with careful engineering — system messages, few-shot examples, "Return ONLY JSON" in all caps. But 90% means 1 in 10 API calls returns something your parser can't handle. At 10,000 calls per day, that's 1,000 crashes. You need something better.
JSON Mode — Constrained Decoding
OpenAI offers two levels of JSON enforcement. The first, response_format={"type": "json_object"}, guarantees syntactically valid JSON. The second, response_format={"type": "json_schema", ...}, guarantees valid JSON that matches your schema.
How does this work under the hood? During token generation, the model's logit distribution is masked to only allow tokens that produce valid JSON. If you've read our softmax post, you know that after softmax produces probabilities over the vocabulary, the model samples a token. Constrained decoding zeros out any token that would break the JSON structure before sampling happens. An open brace must eventually be closed. A string value must be quoted. A number can't contain letters.
Here's basic JSON mode — guarantees valid JSON, but doesn't enforce any particular schema:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Extract person info. Return JSON with: name (string), age (integer)."},
{"role": "user", "content": "John Smith is 27 years old and lives in Portland."}
],
response_format={"type": "json_object"} # guarantees valid JSON
)
import json
data = json.loads(response.choices[0].message.content) # always parses
print(data)
# {"name": "John Smith", "age": 27}
And here's the stricter JSON Schema mode — the model's output is guaranteed to match your exact schema:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Extract person info from the text."},
{"role": "user", "content": "John Smith is 27 years old and lives in Portland."}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "person",
"strict": True,
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"],
"additionalProperties": False
}
}
}
)
data = json.loads(response.choices[0].message.content)
# Guaranteed: {"name": , "age": }
Anthropic doesn't have native JSON mode in the same way. But you can use the prefill trick — start the assistant's response with an opening brace, and the model will continue from there:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
messages=[
{"role": "user", "content": "Extract name (string) and age (integer) as JSON.\nText: John Smith is 27."},
{"role": "assistant", "content": "{"} # prefill trick — force JSON start
]
)
import json
data = json.loads("{" + response.content[0].text) # prepend the brace we started with
print(data)
# {"name": "John Smith", "age": 27}
The gotchas with JSON mode:
- Valid JSON ≠ correct data. The model can output
{"age": -5}— syntactically perfect JSON, semantically nonsensical. - Basic JSON mode doesn't enforce schema. You get valid JSON but might get
{"person_name": "John"}instead of{"name": "John"}. - Schema mode is newer and not available on all models or providers.
- Slight latency overhead. Constrained decoding adds ~10-20% to generation time because the model checks each token against the grammar.
Key insight: JSON mode solves the syntactic problem (no more markdown wrapping, no more commentary, no more truncated JSON). But it doesn't solve the semantic problem. For that, you need validation — which we'll get to in Section 6.
Tool Use / Function Calling
Here's a reframing that changes everything: instead of asking the model to "output JSON," you define a function with typed parameters and let the model "call" it. The model fills in the function arguments, and the API returns them as structured data.
This is how Claude and GPT-4 handle structured extraction in production. The model isn't generating free text that happens to look like JSON — it's filling in a function call, which is a task it's been specifically trained to do.
Here's the Anthropic approach using tool use:
import anthropic, json
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=256,
tools=[{
"name": "extract_person",
"description": "Extract person information from text",
"input_schema": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Full name"},
"age": {"type": "integer", "description": "Age in years"}
},
"required": ["name", "age"]
}
}],
tool_choice={"type": "tool", "name": "extract_person"}, # force this tool
messages=[
{"role": "user", "content": "John Smith is 27 years old and lives in Portland."}
]
)
# The response contains a tool_use block with structured arguments
tool_input = response.content[0].input
print(tool_input)
# {"name": "John Smith", "age": 27}
And the OpenAI equivalent using function calling:
from openai import OpenAI
import json
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "John Smith is 27 years old and lives in Portland."}
],
tools=[{
"type": "function",
"function": {
"name": "extract_person",
"description": "Extract person information from text",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"}
},
"required": ["name", "age"]
}
}
}],
tool_choice={"type": "function", "function": {"name": "extract_person"}}
)
tool_args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(tool_args)
# {"name": "John Smith", "age": 27}
Why is tool use better than raw JSON mode?
- Stronger schema adherence. The model has been trained specifically on function-calling patterns, so it produces the right types more reliably than free-form JSON generation.
- Complex nested schemas. Tool use handles deeply nested objects, arrays of objects, and optional fields more gracefully.
- Wider availability. Tool use works across almost all major LLM providers. JSON Schema mode is OpenAI-specific.
- Semantic clarity. Descriptions on each parameter give the model more context about what each field should contain.
The tradeoffs:
- Cost. Tool definitions are injected into the prompt as tokens. A complex schema can add 200-500 tokens to every request.
- Still wrong values. The types will be right (
agewill be an integer), but the values can still be wrong (age: 270). - Forced vs optional. Without
tool_choice, the model might decide not to call the tool at all. Always force it for extraction tasks.
Pydantic + Instructor — The Production Pattern
The instructor library, by Jason Liu, is the missing piece. It wraps the API client, maps your schema to a Pydantic model, and adds automatic validation with retry. If the model returns data that fails validation, instructor sends the error message back to the model and asks it to try again.
Start with a Pydantic model that defines your schema with type annotations and validators:
from pydantic import BaseModel, Field, field_validator
from typing import Optional
class Person(BaseModel):
name: str = Field(description="Full name of the person")
age: int = Field(description="Age in years", ge=0, le=150)
email: Optional[str] = Field(default=None, description="Email if mentioned")
@field_validator("name")
@classmethod
def name_not_empty(cls, v):
if not v.strip():
raise ValueError("Name cannot be empty")
return v.strip()
Now patch the OpenAI client with instructor and make the call:
import instructor
from openai import OpenAI
client = instructor.from_openai(OpenAI())
person = client.chat.completions.create(
model="gpt-4o-mini",
response_model=Person, # your Pydantic model
max_retries=3, # retry on validation failure
messages=[
{"role": "user", "content": "John Smith is 27 years old. Email: [email protected]"}
]
)
print(person) # name='John Smith' age=27 email='[email protected]'
print(person.model_dump()) # {"name": "John Smith", "age": 27, "email": "[email protected]"}
The same thing works with Anthropic:
import instructor
from anthropic import Anthropic
client = instructor.from_anthropic(Anthropic())
person = client.messages.create(
model="claude-sonnet-4-6",
response_model=Person,
max_retries=3,
max_tokens=256,
messages=[
{"role": "user", "content": "John Smith is 27 years old. Email: [email protected]"}
]
)
print(person.name) # John Smith
print(person.age) # 27
The magic is in the retry. Here's what happens when validation fails:
- Model returns
{"name": "John", "age": -5, "email": null} - Pydantic validates and catches:
age: Value must be >= 0 - Instructor sends a new message to the model: "The previous response failed validation. Error: age — Value must be >= 0. Please fix and try again."
- Model corrects:
{"name": "John", "age": 27, "email": null} - Pydantic validates — passes. Return the object.
Advanced patterns that instructor handles naturally:
- Nested models —
Personwith alist[Address]field, eachAddressbeing its own Pydantic model. - Union types — A field that can be
str | int, letting the model choose the appropriate type. - Computed fields — Fields derived from other fields, validated after extraction.
- Enums — Restrict a field to specific values:
status: Literal["active", "inactive"].
Cost note: Each retry is another API call. With max_retries=3, a worst-case extraction costs 4x a single call. But a failed extraction that crashes your pipeline, requires manual intervention, or serves bad data to users costs far more. Retries are cheap insurance.
The Validation Layer You Still Need
Even with instructor and Pydantic, you need validation beyond schema compliance. There are three levels of correctness:
- Syntactic — Is it valid JSON? Handled by JSON mode and tool use.
- Schema — Does it match the expected types and structure? Handled by Pydantic.
- Semantic — Does it make sense? This one is on you.
Semantic validation catches the errors that pass every other check — a receipt total that doesn't match its line items, a date of birth in the future, a price that's negative. Here's a validation pipeline that handles all three levels:
from pydantic import BaseModel, Field, field_validator, model_validator
from typing import List
class LineItem(BaseModel):
description: str
price: float = Field(ge=0)
class Receipt(BaseModel):
store_name: str
items: List[LineItem]
subtotal: float = Field(ge=0)
tax: float = Field(ge=0)
total: float = Field(ge=0)
@field_validator("store_name")
@classmethod
def store_not_empty(cls, v):
if not v.strip():
raise ValueError("Store name cannot be empty")
return v.strip()
@model_validator(mode="after")
def check_total_matches(self):
"""Semantic validation: total should equal subtotal + tax"""
expected = round(self.subtotal + self.tax, 2)
if abs(self.total - expected) > 0.02: # 2 cent tolerance
raise ValueError(
f"Total ${self.total:.2f} doesn't match "
f"subtotal ${self.subtotal:.2f} + tax ${self.tax:.2f} = ${expected:.2f}"
)
return self
@model_validator(mode="after")
def check_items_sum(self):
"""Semantic validation: line items should sum to subtotal"""
items_sum = round(sum(item.price for item in self.items), 2)
if abs(items_sum - self.subtotal) > 0.05: # 5 cent tolerance
raise ValueError(
f"Line items sum to ${items_sum:.2f} "
f"but subtotal is ${self.subtotal:.2f}"
)
return self
The retry pattern becomes powerful with semantic validation. When the model gets the arithmetic wrong, instructor sends the specific error back:
# What instructor sends to the model on retry:
"""The previous response failed validation:
- Total $45.00 doesn't match subtotal $42.50 + tax $3.40 = $45.90
Please fix the extraction and try again, ensuring the math is correct."""
# The model now has targeted feedback — not just "try again" but
# "your total is wrong, here's what it should be"
# This succeeds on retry ~95% of the time
In our receipt parser, this arithmetic validation caught errors in ~3% of extractions. Without it, those errors would have silently corrupted the database. Three percent sounds small until you realize it's 30 bad records per 1,000 receipts.
Temperature × Structure
If you've read our softmax & temperature post, you know that temperature controls the "sharpness" of the probability distribution over tokens. High temperature = more randomness = more creativity. Low temperature = more deterministic = more predictable.
For structured extraction, this has a direct consequence: higher temperature means more schema violations.
# Same extraction at different temperatures
# (simulated results from 100 calls each)
temperatures = {
0.0: {"valid_json": "100%", "correct_schema": "99.2%", "correct_values": "97.1%"},
0.3: {"valid_json": "100%", "correct_schema": "98.5%", "correct_values": "94.8%"},
0.7: {"valid_json": "99.8%", "correct_schema": "95.1%", "correct_values": "88.3%"},
1.0: {"valid_json": "99.2%", "correct_schema": "89.7%", "correct_values": "79.6%"},
}
# With JSON mode enabled, valid_json stays near 100% at all temperatures.
# But correct_values degrades because temperature affects WHICH valid
# tokens are chosen — "age": 27 vs "age": 28 vs "age": 270
| Temperature | Valid JSON | Correct Schema | Correct Values |
|---|---|---|---|
| 0.0 | 100% | 99.2% | 97.1% |
| 0.3 | 100% | 98.5% | 94.8% |
| 0.7 | 99.8% | 95.1% | 88.3% |
| 1.0 | 99.2% | 89.7% | 79.6% |
Rule of thumb: Use temperature 0 for structured extraction. You want the most likely valid completion, not creative diversity. Even with constrained decoding that guarantees valid JSON, temperature still affects which valid token gets chosen — and at high temperature, "age": 27 can become "age": 28 or "age": 270.
The exception: when you want diverse extractions (brainstorming categories, generating varied examples), use a moderate temperature (0.3-0.5) with validation to catch outliers.
The Decision Tree
Here's the cheat sheet. When you need structured output from an LLM, walk this tree:
And the quick reference:
| Approach | Reliability | Latency | Cost | Complexity |
|---|---|---|---|---|
| Prompt-only | ~90% | Baseline | Lowest | None |
| JSON mode | ~99% syntax | +10-20% | Same | Low |
| Tool use | ~99% schema | +10-20% | +schema tokens | Medium |
| Instructor | ~99.9% | +retry cost | +retry calls | Medium |
For most production use cases, the answer is instructor. It handles both providers, adds validation with retry, and the Pydantic model serves as living documentation of your schema. The only reason to use a simpler approach is if you're prototyping and don't want the dependency.
Try It: Schema Validator
Pick a schema, edit the JSON, and validate. See all three levels — syntactic, schema, and semantic — in action.
The demo above runs entirely in your browser — no API calls. It demonstrates the same three-level validation pipeline you'd build in production: parse the JSON (syntactic), check it against the schema (types and required fields), then run semantic checks (business logic). When validation fails, the retry prompt shows exactly what instructor would send back to the model.
Ship It
The path from "the LLM sometimes returns JSON" to "the LLM always returns validated, typed, schema-compliant data" is four steps: prompt-only gets you to 90%, JSON mode gets you to 99% syntactic, tool use gets you to 99% schema, and instructor + Pydantic gets you the rest of the way with automatic retry. Each step is more code and more cost, but each step also eliminates a class of failure that would otherwise crash your pipeline at 3 AM.
Start with the simplest approach that meets your reliability needs. For a prototype, JSON mode is fine. For production, instructor is the answer. And whatever you choose, set temperature to 0 and add semantic validation. Your future self — the one not getting paged about malformed JSON — will thank you.
References & Further Reading
- OpenAI — Structured Outputs — Official documentation for JSON mode and JSON Schema mode
- Anthropic — Tool Use — Claude's approach to structured extraction via function calling
- Jason Liu — Instructor — The Pydantic-based library for validated LLM extraction
- Pydantic Documentation — Data validation and settings management using Python type annotations
- Outlines — Grammar-based constrained decoding for open-source models
- DadOps — Using LLMs to Parse Grocery Receipts — A practical application of structured extraction
- DadOps — Softmax & Temperature from Scratch — How temperature affects token selection