Context Window Strategies: Fitting the World Into Your LLM’s Memory
The Context Window Problem
Context windows have grown 250× in three years. GPT-3.5 launched with 4K tokens. Today Claude offers 200K, Gemini 2.5 Pro handles 1M, and Llama 4 Scout claims 10M. Problem solved, right?
Not even close. Three forces conspire to make context management a permanent engineering challenge:
Cost scales linearly. A 200K-token prompt costs 50× more than a 4K-token prompt. Some providers charge a premium multiplier once you exceed certain thresholds. Stuffing every document into the context window is the “SELECT *” of LLM engineering — it works until the bill arrives.
The “Lost in the Middle” effect. Liu et al. (2023) demonstrated that LLMs pay strong attention to the beginning and end of the context but suffer 30%+ accuracy drops for information buried in the middle. Recent research suggests this isn’t a bug — it’s an emergent property of how positional encodings like RoPE handle long-range dependencies. Bigger windows don’t fix this; they can make it worse.
Latency grows with context. Time-to-first-token scales with prompt length. KV cache memory grows linearly with sequence length, eventually spilling from fast GPU SRAM to slower DRAM. Your users don’t care about your model’s 1M context window if the response takes 30 seconds to start.
This isn’t a hardware problem that next year’s GPUs will solve — it’s an information architecture problem. The question isn’t “how much can I fit?” but “what should I include?”
In this post, we’ll build five strategies for managing context, each one a step up in sophistication. By the end, you’ll have a decision framework — and working code — for choosing the right approach for any situation.
Try It: Lost in the Middle Explorer
Drag the slider to place a “target fact” at different positions in the context window. Notice how retrieval probability drops in the middle — especially for long contexts.
Strategy 1: Smart Truncation
Truncation gets a bad rap. Most developers think of it as “chop off the end when you hit the limit” — and that is terrible. But priority-based truncation is surprisingly effective.
The idea: score each section of your document by relevance, boost sections near the beginning (context setup) and end (recency), then greedily pack the highest-scoring sections into your token budget. Crucially, we re-sort the kept sections by their original position so the LLM reads them in coherent order.
def smart_truncate(sections, max_tokens=4096, query=None):
"""Keep the most relevant sections within a token budget.
Each section: {"text": str, "position": float 0-1, "priority": float 0-1}
"""
scored = []
for sec in sections:
score = sec.get("priority", 0.5)
pos = sec["position"]
# Boost beginning (context setup) and end (recency)
if pos < 0.1:
score += 0.3
elif pos > 0.9:
score += 0.2
# Boost sections matching the query
if query:
query_words = set(query.lower().split())
text_words = set(sec["text"].lower().split())
overlap = len(query_words & text_words)
score += min(overlap * 0.1, 0.4)
scored.append((score, sec))
# Sort by score descending, greedily fill the budget
scored.sort(key=lambda x: -x[0])
kept, budget = [], max_tokens
for score, sec in scored:
tokens = int(len(sec["text"].split()) * 1.3) # ~1.3 tokens/word
if tokens <= budget:
kept.append(sec)
budget -= tokens
# Restore original order for coherent reading
kept.sort(key=lambda s: s["position"])
return "\n\n".join(s["text"] for s in kept)
The word-count approximation (words * 1.3) works for prototyping. In production, swap it for tiktoken or your provider’s tokenizer for exact counts.
When truncation is enough: single-document QA where the answer is near the beginning or end, chat histories where recent context matters most, and cost-constrained applications where every token counts. It’s fast, cheap, and often gets you 60%+ of the way there.
Strategy 2: Chunk-and-Summarize
When you need the full document’s content but it won’t fit — compress it. The chunk-and-summarize pattern splits your document into overlapping windows, summarizes each one, and concatenates the summaries into a compressed context that fits your budget.
The overlap parameter is the key design decision. Too little overlap and you lose information at chunk boundaries. Too much and you waste tokens on redundant content. The sweet spot is 10–15% overlap.
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
def chunk_document(text, chunk_size=2000, overlap=200):
"""Split text into overlapping chunks by word count."""
words = text.split()
chunks, start = [], 0
while start < len(words):
end = min(start + chunk_size, len(words))
chunks.append(" ".join(words[start:end]))
start += chunk_size - overlap
return chunks
async def summarize_chunk(chunk, prior_context=""):
"""Summarize one chunk, with optional context from prior summaries."""
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{
"role": "user",
"content": (
f"Summarize this text concisely. "
f"Preserve key facts, numbers, and names.\n"
f"Prior context: {prior_context[:200]}\n\n{chunk}"
)
}]
)
return response.content[0].text
async def chunk_and_summarize(document, chunk_size=2000, overlap=200):
"""Compress a long document by chunking and summarizing."""
chunks = chunk_document(document, chunk_size, overlap)
summaries = []
for i, chunk in enumerate(chunks):
context = summaries[-1] if summaries else ""
summary = await summarize_chunk(chunk, context)
summaries.append(summary)
compressed = "\n\n".join(summaries)
ratio = len(document.split()) / max(len(compressed.split()), 1)
print(f"Compressed {len(chunks)} chunks: {ratio:.1f}x reduction")
return compressed
Notice we pass the previous chunk’s summary as context when summarizing the next chunk. This creates a “rolling window” of continuity — each summary can reference entities introduced in the previous section without losing track of who “she” or “the defendant” refers to. This mirrors the chunking strategy in RAG from Scratch, but with summarization replacing retrieval.
How much does compression help? Here’s a comparison on a simulated 50-page contract QA task:
| Approach | Tokens Used | Answer Accuracy | Why |
|---|---|---|---|
| First-4K truncation | 4,000 | ~45% | Misses clauses in later sections |
| Last-4K truncation | 4,000 | ~40% | Misses context setup and definitions |
| Smart truncation | 4,000 | ~62% | Keeps relevant sections but gaps remain |
| Chunk-and-summarize | 5,000 | ~78% | Full coverage, compressed |
| Full document (200K) | 65,000 | ~85% | Best accuracy, 13× the cost |
Chunk-and-summarize gets you to 78% accuracy at a fraction of the cost. The 7-point gap to full-document is the price of compression — specific numerical details and exact quotes get smoothed out in summaries.
Strategy 3: Map-Reduce for Cross-Document Reasoning
The first two strategies work on a single document. But what happens when you need to reason across multiple documents — comparing ten earnings reports, finding contradictions between legal filings, or aggregating survey results?
Enter map-reduce, borrowed from distributed computing. The map phase processes each document independently with the same question (and runs in parallel). The reduce phase synthesizes per-document extractions into a final answer.
import asyncio
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async def map_document(doc, question, doc_id):
"""Map phase: extract relevant info from one document."""
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
messages=[{
"role": "user",
"content": (
f"Document {doc_id}:\n{doc}\n\n"
f"Question: {question}\n\n"
f"Extract all relevant facts, numbers, and quotes."
)
}]
)
return {"doc_id": doc_id, "extraction": response.content[0].text}
async def reduce_results(extractions, question):
"""Reduce phase: synthesize extractions into a final answer."""
combined = "\n\n".join(
f"--- Document {e['doc_id']} ---\n{e['extraction']}"
for e in extractions
)
response = await client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2000,
messages=[{
"role": "user",
"content": (
f"Extractions from {len(extractions)} documents:\n\n"
f"{combined}\n\nQuestion: {question}\n\n"
f"Synthesize a comprehensive answer. "
f"Note contradictions. Cite document numbers."
)
}]
)
return response.content[0].text
async def map_reduce(documents, question):
"""Process multiple documents with parallel map, then reduce."""
map_tasks = [
map_document(doc, question, i + 1)
for i, doc in enumerate(documents)
]
extractions = await asyncio.gather(*map_tasks)
return await reduce_results(extractions, question)
The magic is in the parallelism. Ten documents at 50K tokens each would take ~80 seconds sequentially. With asyncio.gather, the map phase runs all ten concurrently, dropping wall-clock time to ~8 seconds. The reduce call adds another ~3 seconds. That’s a 7× speedup for free. See Concurrency Patterns for LLM APIs for more on async parallelization, and Batch Processing with LLMs for scaling this to thousands of documents.
Map-reduce shines for: cross-document comparison (“how do these 10 contracts differ on liability?”), contradiction detection between sources, and aggregating statistics from multiple reports.
Strategy 4: Hierarchical Summarization
What if your document is so long that even chunk-and-summarize produces too much output? A 500K-token document split into 2K-word chunks gives you 250 summaries — those summaries alone might total 50K tokens.
Hierarchical summarization applies the compression recursively: summarize the summaries, then summarize those summaries, until the result fits your target budget. Think of it as building a tree where each level compresses by roughly 5×:
| Level | Tokens | Chunks | Description |
|---|---|---|---|
| 0 (raw) | 500,000 | 100 | Original document |
| 1 | 25,000 | 20 | First-pass summaries |
| 2 | 5,000 | 4 | Summary-of-summaries |
| 3 | 1,500 | 1 | Executive summary |
async def hierarchical_summarize(text, target_tokens=1500, level=0):
"""Recursively summarize until text fits the token budget."""
current_tokens = int(len(text.split()) * 1.3)
if current_tokens <= target_tokens:
print(f" Level {level}: {current_tokens} tokens fits budget")
return text
chunks = chunk_document(text, chunk_size=3000, overlap=300)
print(f" Level {level}: {current_tokens} tokens in {len(chunks)} chunks")
# Summarize all chunks in parallel
summaries = await asyncio.gather(*[
summarize_chunk(chunk) for chunk in chunks
])
compressed = "\n\n".join(summaries)
new_tokens = int(len(compressed.split()) * 1.3)
print(f" Level {level}: compressed to {new_tokens} tokens "
f"({current_tokens / max(new_tokens, 1):.1f}x)")
# Recurse until it fits
return await hierarchical_summarize(
compressed, target_tokens, level + 1
)
Each level trades detail for breadth. Level 1 preserves specific facts and numbers. Level 3 preserves only themes and conclusions. Choose your depth based on the task: factual QA needs Level 1; thematic analysis can survive Level 3.
The information loss tradeoff: hierarchical summarization is inherently lossy. Every compression level acts like a low-pass filter — fine-grained details vanish first, broad patterns persist. This makes it excellent for executive overviews and theme extraction across large corpora, but poor for tasks requiring specific facts buried deep in the original text.
Strategy 5: Agentic Context Management
Every strategy so far preprocesses the entire document before asking the LLM to reason. But what if the LLM could decide what to read?
The agentic approach gives the model two tools — search and read_section — plus the document’s table of contents (always in context). Instead of stuffing everything into the prompt, the agent navigates the document selectively, reading only the sections it needs. This uses 10–20× fewer input tokens than full-document stuffing, at the cost of 3–5 LLM round trips.
from anthropic import Anthropic
client = Anthropic()
TOOLS = [
{
"name": "search",
"description": "Search the document for relevant sections.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
},
{
"name": "read_section",
"description": "Read the full text of a section by its heading.",
"input_schema": {
"type": "object",
"properties": {
"heading": {"type": "string", "description": "Section heading"}
},
"required": ["heading"]
}
}
]
def context_agent(doc_index, search_fn, read_fn, question, max_reads=5):
"""Navigate a document with tools to answer a question."""
messages = [{
"role": "user",
"content": (
f"Document outline:\n{doc_index}\n\n"
f"Question: {question}\n\n"
f"Search and read sections to answer this. "
f"You have {max_reads} reads -- be selective."
)
}]
reads_used = 0
while reads_used < max_reads:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1500,
tools=TOOLS,
messages=messages
)
# If the model stopped generating (no more tool calls), return
if response.stop_reason == "end_turn":
return response.content[0].text
# Append assistant turn once, then collect all tool results
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
if block.name == "search":
result = search_fn(block.input["query"])
else:
result = read_fn(block.input["heading"])
reads_used += 1
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
return "Reached read limit without a final answer."
This pattern connects directly to the agent loop from Building AI Agents from Scratch and the tool definitions from LLM Function Calling. The agent reads selectively, accumulates findings in its conversation history, and synthesizes an answer only from the sections it actually read.
When to go agentic: complex reasoning that requires combining information from unpredictable sections, multi-hop questions (“what was the revenue growth in the quarter after the CEO change?”), and open-ended research tasks where you can’t predict which sections matter ahead of time.
Choosing Your Strategy
Five strategies, each with different tradeoffs. Here’s the decision matrix:
| Strategy | Max Size | Best For | Cost | Latency | Accuracy |
|---|---|---|---|---|---|
| Smart Truncation | <100K | Single-doc, answer near edges | Very Low | Very Fast | Medium |
| Chunk-Summarize | 100K–500K | Full-doc comprehension | Medium | Medium | Good |
| Map-Reduce | Multi-doc | Cross-doc comparison | Medium-High | Fast (parallel) | Good |
| Hierarchical | 500K–5M | Very long docs, overviews | High | Slow | Medium |
| Agentic | Any size | Complex reasoning, research | Low/query | Variable | Highest |
And here are the rules of thumb, encoded as a function you can drop into any project:
def select_strategy(total_tokens, num_documents=1,
task_type="qa", latency_budget_ms=10000):
"""Pick the best context strategy for the situation."""
# Small enough to fit directly? Just send it.
if total_tokens < 50_000 and num_documents == 1:
return "direct", "Fits in context, no processing needed"
# Multiple documents always benefit from map-reduce
if num_documents > 1:
return "map_reduce", f"{num_documents} docs: parallel map then reduce"
# Single very large document
if total_tokens > 500_000:
if task_type == "overview":
return "hierarchical", "Huge doc + overview: recursive summarization"
return "agentic", "Huge doc + specific task: selective reading"
# Medium documents (50K-500K)
if total_tokens > 100_000:
if latency_budget_ms < 5000:
return "smart_truncation", "Tight latency: fast truncation"
return "chunk_summarize", "Mid-size doc: chunk and summarize"
# 50K-100K range
if task_type in ("comparison", "analysis"):
return "chunk_summarize", "Analysis needs full coverage"
return "smart_truncation", "Default for moderate single docs"
# Example usage
strategy, reason = select_strategy(
total_tokens=250_000, num_documents=1, task_type="qa"
)
print(f"Strategy: {strategy}")
print(f"Reason: {reason}")
# Output: Strategy: chunk_summarize
# Reason: Mid-size doc: chunk and summarize
For model selection within each strategy, consider using a smaller, faster model (like Claude Haiku or GPT-4o-mini) for the map and summarize phases where you’re doing extraction rather than complex reasoning, and reserving your most capable model for the final synthesis step. See LLM Model Routing for more on this pattern.
Try It: Context Window Strategy Visualizer
Select a strategy to see how it processes a document. Each colored block is a section — darker blocks are more relevant to the query. Compare how each strategy selects and compresses.
Putting It All Together
Context window management is less about clever algorithms and more about asking the right question: what does the LLM actually need to see? A 200K-token document stuffed verbatim into the prompt might score 85% accuracy — but smart truncation at 4K tokens gets you 62% for 1/50th the cost, and chunk-and-summarize at 5K tokens reaches 78% for 1/13th the cost.
Start simple. Try truncation first. Move to chunk-and-summarize when you need full coverage. Reach for map-reduce when documents multiply. Escalate to hierarchical summarization for truly massive inputs. And deploy agentic context management when the task demands surgical precision.
The best strategy is the one that matches your constraints — and now you have code for all five. The context window is a budget, not a target. Spend it wisely.
Key takeaway: The cost of context isn’t just tokens — it’s attention. More context doesn’t mean better answers. Strategic context means better answers at lower cost. The “lost in the middle” effect proves that what you put in matters more than how much you put in.
References & Further Reading
- Liu et al. — “Lost in the Middle: How Language Models Use Long Contexts” — the foundational study on positional attention bias in LLMs (2023, TACL 2024)
- Found in the Middle — Plug-and-Play Positional Encoding — a positional encoding fix that mitigates the lost-in-the-middle effect (2024)
- Chroma Research — “Context Rot” — how increasing input tokens degrades LLM performance across tasks
- Anthropic — Prompt Caching — reduce costs for repeated context prefixes when using large contexts with Claude
- DadOps — RAG from Scratch — chunking strategies that complement the compression approaches here
- DadOps — KV Cache from Scratch — the hardware foundation for why long context is expensive
- DadOps — Building AI Agents — the agent loop pattern used in Strategy 5