Context Window Strategies: Fitting the World Into Your LLM’s Memory

February 26, 2026 · Applied AI · 15 min read

The Context Window Problem

Context windows have grown 250× in three years. GPT-3.5 launched with 4K tokens. Today Claude offers 200K, Gemini 2.5 Pro handles 1M, and Llama 4 Scout claims 10M. Problem solved, right?

Not even close. Three forces conspire to make context management a permanent engineering challenge:

Cost scales linearly. A 200K-token prompt costs 50× more than a 4K-token prompt. Some providers charge a premium multiplier once you exceed certain thresholds. Stuffing every document into the context window is the “SELECT *” of LLM engineering — it works until the bill arrives.

The “Lost in the Middle” effect. Liu et al. (2023) demonstrated that LLMs pay strong attention to the beginning and end of the context but suffer 30%+ accuracy drops for information buried in the middle. Recent research suggests this isn’t a bug — it’s an emergent property of how positional encodings like RoPE handle long-range dependencies. Bigger windows don’t fix this; they can make it worse.

Latency grows with context. Time-to-first-token scales with prompt length. KV cache memory grows linearly with sequence length, eventually spilling from fast GPU SRAM to slower DRAM. Your users don’t care about your model’s 1M context window if the response takes 30 seconds to start.

This isn’t a hardware problem that next year’s GPUs will solve — it’s an information architecture problem. The question isn’t “how much can I fit?” but “what should I include?”

In this post, we’ll build five strategies for managing context, each one a step up in sophistication. By the end, you’ll have a decision framework — and working code — for choosing the right approach for any situation.

Try It: Lost in the Middle Explorer

Drag the slider to place a “target fact” at different positions in the context window. Notice how retrieval probability drops in the middle — especially for long contexts.

Fact Position: 11 of 20 Context:

Strategy 1: Smart Truncation

Truncation gets a bad rap. Most developers think of it as “chop off the end when you hit the limit” — and that is terrible. But priority-based truncation is surprisingly effective.

The idea: score each section of your document by relevance, boost sections near the beginning (context setup) and end (recency), then greedily pack the highest-scoring sections into your token budget. Crucially, we re-sort the kept sections by their original position so the LLM reads them in coherent order.

def smart_truncate(sections, max_tokens=4096, query=None):
    """Keep the most relevant sections within a token budget.

    Each section: {"text": str, "position": float 0-1, "priority": float 0-1}
    """
    scored = []
    for sec in sections:
        score = sec.get("priority", 0.5)
        pos = sec["position"]

        # Boost beginning (context setup) and end (recency)
        if pos < 0.1:
            score += 0.3
        elif pos > 0.9:
            score += 0.2

        # Boost sections matching the query
        if query:
            query_words = set(query.lower().split())
            text_words = set(sec["text"].lower().split())
            overlap = len(query_words & text_words)
            score += min(overlap * 0.1, 0.4)

        scored.append((score, sec))

    # Sort by score descending, greedily fill the budget
    scored.sort(key=lambda x: -x[0])
    kept, budget = [], max_tokens

    for score, sec in scored:
        tokens = int(len(sec["text"].split()) * 1.3)  # ~1.3 tokens/word
        if tokens <= budget:
            kept.append(sec)
            budget -= tokens

    # Restore original order for coherent reading
    kept.sort(key=lambda s: s["position"])
    return "\n\n".join(s["text"] for s in kept)

The word-count approximation (words * 1.3) works for prototyping. In production, swap it for tiktoken or your provider’s tokenizer for exact counts.

When truncation is enough: single-document QA where the answer is near the beginning or end, chat histories where recent context matters most, and cost-constrained applications where every token counts. It’s fast, cheap, and often gets you 60%+ of the way there.

Strategy 2: Chunk-and-Summarize

When you need the full document’s content but it won’t fit — compress it. The chunk-and-summarize pattern splits your document into overlapping windows, summarizes each one, and concatenates the summaries into a compressed context that fits your budget.

The overlap parameter is the key design decision. Too little overlap and you lose information at chunk boundaries. Too much and you waste tokens on redundant content. The sweet spot is 10–15% overlap.

from anthropic import AsyncAnthropic

client = AsyncAnthropic()


def chunk_document(text, chunk_size=2000, overlap=200):
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks, start = [], 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunks.append(" ".join(words[start:end]))
        start += chunk_size - overlap
    return chunks


async def summarize_chunk(chunk, prior_context=""):
    """Summarize one chunk, with optional context from prior summaries."""
    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Summarize this text concisely. "
                f"Preserve key facts, numbers, and names.\n"
                f"Prior context: {prior_context[:200]}\n\n{chunk}"
            )
        }]
    )
    return response.content[0].text


async def chunk_and_summarize(document, chunk_size=2000, overlap=200):
    """Compress a long document by chunking and summarizing."""
    chunks = chunk_document(document, chunk_size, overlap)
    summaries = []

    for i, chunk in enumerate(chunks):
        context = summaries[-1] if summaries else ""
        summary = await summarize_chunk(chunk, context)
        summaries.append(summary)

    compressed = "\n\n".join(summaries)
    ratio = len(document.split()) / max(len(compressed.split()), 1)
    print(f"Compressed {len(chunks)} chunks: {ratio:.1f}x reduction")
    return compressed

Notice we pass the previous chunk’s summary as context when summarizing the next chunk. This creates a “rolling window” of continuity — each summary can reference entities introduced in the previous section without losing track of who “she” or “the defendant” refers to. This mirrors the chunking strategy in RAG from Scratch, but with summarization replacing retrieval.

How much does compression help? Here’s a comparison on a simulated 50-page contract QA task:

Approach	Tokens Used	Answer Accuracy	Why
First-4K truncation	4,000	~45%	Misses clauses in later sections
Last-4K truncation	4,000	~40%	Misses context setup and definitions
Smart truncation	4,000	~62%	Keeps relevant sections but gaps remain
Chunk-and-summarize	5,000	~78%	Full coverage, compressed
Full document (200K)	65,000	~85%	Best accuracy, 13× the cost

Chunk-and-summarize gets you to 78% accuracy at a fraction of the cost. The 7-point gap to full-document is the price of compression — specific numerical details and exact quotes get smoothed out in summaries.

Strategy 3: Map-Reduce for Cross-Document Reasoning

The first two strategies work on a single document. But what happens when you need to reason across multiple documents — comparing ten earnings reports, finding contradictions between legal filings, or aggregating survey results?

Enter map-reduce, borrowed from distributed computing. The map phase processes each document independently with the same question (and runs in parallel). The reduce phase synthesizes per-document extractions into a final answer.

import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()


async def map_document(doc, question, doc_id):
    """Map phase: extract relevant info from one document."""
    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": (
                f"Document {doc_id}:\n{doc}\n\n"
                f"Question: {question}\n\n"
                f"Extract all relevant facts, numbers, and quotes."
            )
        }]
    )
    return {"doc_id": doc_id, "extraction": response.content[0].text}


async def reduce_results(extractions, question):
    """Reduce phase: synthesize extractions into a final answer."""
    combined = "\n\n".join(
        f"--- Document {e['doc_id']} ---\n{e['extraction']}"
        for e in extractions
    )
    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": (
                f"Extractions from {len(extractions)} documents:\n\n"
                f"{combined}\n\nQuestion: {question}\n\n"
                f"Synthesize a comprehensive answer. "
                f"Note contradictions. Cite document numbers."
            )
        }]
    )
    return response.content[0].text


async def map_reduce(documents, question):
    """Process multiple documents with parallel map, then reduce."""
    map_tasks = [
        map_document(doc, question, i + 1)
        for i, doc in enumerate(documents)
    ]
    extractions = await asyncio.gather(*map_tasks)
    return await reduce_results(extractions, question)

The magic is in the parallelism. Ten documents at 50K tokens each would take ~80 seconds sequentially. With asyncio.gather, the map phase runs all ten concurrently, dropping wall-clock time to ~8 seconds. The reduce call adds another ~3 seconds. That’s a 7× speedup for free. See Concurrency Patterns for LLM APIs for more on async parallelization, and Batch Processing with LLMs for scaling this to thousands of documents.

Map-reduce shines for: cross-document comparison (“how do these 10 contracts differ on liability?”), contradiction detection between sources, and aggregating statistics from multiple reports.

Strategy 4: Hierarchical Summarization

What if your document is so long that even chunk-and-summarize produces too much output? A 500K-token document split into 2K-word chunks gives you 250 summaries — those summaries alone might total 50K tokens.

Hierarchical summarization applies the compression recursively: summarize the summaries, then summarize those summaries, until the result fits your target budget. Think of it as building a tree where each level compresses by roughly 5×:

Level	Tokens	Chunks	Description
0 (raw)	500,000	100	Original document
1	25,000	20	First-pass summaries
2	5,000	4	Summary-of-summaries
3	1,500	1	Executive summary

async def hierarchical_summarize(text, target_tokens=1500, level=0):
    """Recursively summarize until text fits the token budget."""
    current_tokens = int(len(text.split()) * 1.3)

    if current_tokens <= target_tokens:
        print(f"  Level {level}: {current_tokens} tokens fits budget")
        return text

    chunks = chunk_document(text, chunk_size=3000, overlap=300)
    print(f"  Level {level}: {current_tokens} tokens in {len(chunks)} chunks")

    # Summarize all chunks in parallel
    summaries = await asyncio.gather(*[
        summarize_chunk(chunk) for chunk in chunks
    ])

    compressed = "\n\n".join(summaries)
    new_tokens = int(len(compressed.split()) * 1.3)
    print(f"  Level {level}: compressed to {new_tokens} tokens "
          f"({current_tokens / max(new_tokens, 1):.1f}x)")

    # Recurse until it fits
    return await hierarchical_summarize(
        compressed, target_tokens, level + 1
    )

Each level trades detail for breadth. Level 1 preserves specific facts and numbers. Level 3 preserves only themes and conclusions. Choose your depth based on the task: factual QA needs Level 1; thematic analysis can survive Level 3.

The information loss tradeoff: hierarchical summarization is inherently lossy. Every compression level acts like a low-pass filter — fine-grained details vanish first, broad patterns persist. This makes it excellent for executive overviews and theme extraction across large corpora, but poor for tasks requiring specific facts buried deep in the original text.

Strategy 5: Agentic Context Management

Every strategy so far preprocesses the entire document before asking the LLM to reason. But what if the LLM could decide what to read?

The agentic approach gives the model two tools — search and read_section — plus the document’s table of contents (always in context). Instead of stuffing everything into the prompt, the agent navigates the document selectively, reading only the sections it needs. This uses 10–20× fewer input tokens than full-document stuffing, at the cost of 3–5 LLM round trips.

from anthropic import Anthropic

client = Anthropic()

TOOLS = [
    {
        "name": "search",
        "description": "Search the document for relevant sections.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "Search query"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "read_section",
        "description": "Read the full text of a section by its heading.",
        "input_schema": {
            "type": "object",
            "properties": {
                "heading": {"type": "string", "description": "Section heading"}
            },
            "required": ["heading"]
        }
    }
]


def context_agent(doc_index, search_fn, read_fn, question, max_reads=5):
    """Navigate a document with tools to answer a question."""
    messages = [{
        "role": "user",
        "content": (
            f"Document outline:\n{doc_index}\n\n"
            f"Question: {question}\n\n"
            f"Search and read sections to answer this. "
            f"You have {max_reads} reads -- be selective."
        )
    }]
    reads_used = 0

    while reads_used < max_reads:
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1500,
            tools=TOOLS,
            messages=messages
        )

        # If the model stopped generating (no more tool calls), return
        if response.stop_reason == "end_turn":
            return response.content[0].text

        # Append assistant turn once, then collect all tool results
        messages.append({"role": "assistant", "content": response.content})
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                if block.name == "search":
                    result = search_fn(block.input["query"])
                else:
                    result = read_fn(block.input["heading"])
                    reads_used += 1
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })
        messages.append({"role": "user", "content": tool_results})

    return "Reached read limit without a final answer."

This pattern connects directly to the agent loop from Building AI Agents from Scratch and the tool definitions from LLM Function Calling. The agent reads selectively, accumulates findings in its conversation history, and synthesizes an answer only from the sections it actually read.

When to go agentic: complex reasoning that requires combining information from unpredictable sections, multi-hop questions (“what was the revenue growth in the quarter after the CEO change?”), and open-ended research tasks where you can’t predict which sections matter ahead of time.

Choosing Your Strategy

Five strategies, each with different tradeoffs. Here’s the decision matrix:

Strategy	Max Size	Best For	Cost	Latency	Accuracy
Smart Truncation	<100K	Single-doc, answer near edges	Very Low	Very Fast	Medium
Chunk-Summarize	100K–500K	Full-doc comprehension	Medium	Medium	Good
Map-Reduce	Multi-doc	Cross-doc comparison	Medium-High	Fast (parallel)	Good
Hierarchical	500K–5M	Very long docs, overviews	High	Slow	Medium
Agentic	Any size	Complex reasoning, research	Low/query	Variable	Highest

And here are the rules of thumb, encoded as a function you can drop into any project:

def select_strategy(total_tokens, num_documents=1,
                    task_type="qa", latency_budget_ms=10000):
    """Pick the best context strategy for the situation."""

    # Small enough to fit directly? Just send it.
    if total_tokens < 50_000 and num_documents == 1:
        return "direct", "Fits in context, no processing needed"

    # Multiple documents always benefit from map-reduce
    if num_documents > 1:
        return "map_reduce", f"{num_documents} docs: parallel map then reduce"

    # Single very large document
    if total_tokens > 500_000:
        if task_type == "overview":
            return "hierarchical", "Huge doc + overview: recursive summarization"
        return "agentic", "Huge doc + specific task: selective reading"

    # Medium documents (50K-500K)
    if total_tokens > 100_000:
        if latency_budget_ms < 5000:
            return "smart_truncation", "Tight latency: fast truncation"
        return "chunk_summarize", "Mid-size doc: chunk and summarize"

    # 50K-100K range
    if task_type in ("comparison", "analysis"):
        return "chunk_summarize", "Analysis needs full coverage"

    return "smart_truncation", "Default for moderate single docs"


# Example usage
strategy, reason = select_strategy(
    total_tokens=250_000, num_documents=1, task_type="qa"
)
print(f"Strategy: {strategy}")
print(f"Reason: {reason}")
# Output: Strategy: chunk_summarize
#         Reason: Mid-size doc: chunk and summarize

For model selection within each strategy, consider using a smaller, faster model (like Claude Haiku or GPT-4o-mini) for the map and summarize phases where you’re doing extraction rather than complex reasoning, and reserving your most capable model for the final synthesis step. See LLM Model Routing for more on this pattern.

Try It: Context Window Strategy Visualizer

Select a strategy to see how it processes a document. Each colored block is a section — darker blocks are more relevant to the query. Compare how each strategy selects and compresses.

Putting It All Together

Context window management is less about clever algorithms and more about asking the right question: what does the LLM actually need to see? A 200K-token document stuffed verbatim into the prompt might score 85% accuracy — but smart truncation at 4K tokens gets you 62% for 1/50th the cost, and chunk-and-summarize at 5K tokens reaches 78% for 1/13th the cost.

Start simple. Try truncation first. Move to chunk-and-summarize when you need full coverage. Reach for map-reduce when documents multiply. Escalate to hierarchical summarization for truly massive inputs. And deploy agentic context management when the task demands surgical precision.

The best strategy is the one that matches your constraints — and now you have code for all five. The context window is a budget, not a target. Spend it wisely.

Key takeaway: The cost of context isn’t just tokens — it’s attention. More context doesn’t mean better answers. Strategic context means better answers at lower cost. The “lost in the middle” effect proves that what you put in matters more than how much you put in.

References & Further Reading

Liu et al. — “Lost in the Middle: How Language Models Use Long Contexts” — the foundational study on positional attention bias in LLMs (2023, TACL 2024)
Found in the Middle — Plug-and-Play Positional Encoding — a positional encoding fix that mitigates the lost-in-the-middle effect (2024)
Chroma Research — “Context Rot” — how increasing input tokens degrades LLM performance across tasks
Anthropic — Prompt Caching — reduce costs for repeated context prefixes when using large contexts with Claude
DadOps — RAG from Scratch — chunking strategies that complement the compression approaches here
DadOps — KV Cache from Scratch — the hardware foundation for why long context is expensive
DadOps — Building AI Agents — the agent loop pattern used in Strategy 5