Why bigger context windows don’t fix context management

Build with Claude: From Prompts to Production Agents · Post 2 of 8

May 18, 2026

Earlier this year, Anthropic made the 1 million token context window generally available on Claude Sonnet 4.6 and Opus 4.6. A lot of developers heard the headline and concluded that context management was over. The window got five times bigger. Just dump everything in.

This is the wrong lesson. The right lesson is more interesting.

The 1M window doesn’t eliminate context management. It raises the stakes of doing it badly. When you could fit 200,000 tokens, dumping everything in was inefficient. When you can fit a million, dumping everything in is expensive, slow, and produces worse outputs than a smaller, better-curated context. The skill hasn’t gone away. It’s gotten more important, because the temptation to skip it is now overwhelming.

This post is about the shift from “what should I stuff into the prompt?” to “what should I let the model fetch when it needs to?” You’ll see a real agent watching itself make those decisions, on a document you already know - the Union Budget 2026-27 speech. The post is built around three experiments. Each one names what’s being tested, shows the numbers, and reads out what it means.

What you’ll take away from this post

If you only have five minutes, here’s what the post argues, with the evidence to follow:

Bigger context windows make the discipline of context management more important, not less. Models still get worse at finding information as the context grows, even when it fits.
Stuffing everything into the prompt produces unreliable outputs in a way you can’t see. The model arrives at slightly different answers each run, and you have no way to tell why.
The Model Context Protocol (MCP) lets the model decide what to fetch instead of you guessing in advance. The model’s decisions become visible, which makes them debuggable.
MCP is not automatically cheaper than stuffing. On focused questions it saves real money. On broad cross-cutting questions it can cost more. It is always more grounded and more traceable.
The architectural shift is from pushing context into the model to letting the model pull what it needs. Every later post in this series builds on this shift.

If you’d rather read the evidence, keep going. If you’d rather skip to the code, the repo is at github.com/gradtensor/build-with-claude and the tag is post-02.

What’s actually in a model’s context

Most developers think of “context” as the document you’re asking the model about. That’s one piece, and usually not the biggest one. A production system’s context is a stack of things, each competing for the same space.

Pull apart the next Claude API call your application makes and you’ll find:

The system prompt, with whatever framework you’re using to structure it (GCAO if you read Post 1). Cached after the first call if you set it up right.

Tool definitions, if the model has tools available. Each tool gets its name, description, and input schema declared up front. Six tools easily costs you 1,500 tokens.

Few-shot examples, if you’re teaching the model a pattern. Two good examples is often 600 to 1,000 tokens.

Conversation history from prior turns. This is the silent killer. It grows monotonically. After ten turns of a real conversation, it can be the largest single component.

Retrieved or attached documents. The Budget speech alone is about 27,000 tokens. A multi-document corpus can be hundreds of thousands.

The current user message. Usually small. Sometimes the only thing developers think about.

The model’s reasoning output. Counts against the same budget on the output side.

Stack these up for a real analyst and you’re easily at 40,000 to 80,000 tokens per turn before you’ve done anything sophisticated. That fits comfortably in the 1M window. The problem isn’t fitting. The problem is what fits well, and what fits badly.

Context rot is real, and it’s been measured

There’s a phenomenon researchers call “context rot” or “lost in the middle.” As the context gets larger, the model gets worse at finding and using information from inside it, particularly information that sits in the middle of the haystack. This isn’t a Claude-specific quirk. It’s been measured across every long-context model.

Anthropic has done real work to mitigate this. Sonnet 4.6 scores 68.4 percent on the GraphWalks benchmark at 1M tokens, which is good. It’s also not 100. The model is meaningfully less reliable at retrieving specific facts from a 900,000-token context than from a 30,000-token one. This is true even when the fact is unambiguous and present in the text.

What this means for production systems: if your application stuffs everything available into every call, it will probably find what it needs each time. But probably is the wrong word for production. You need consistently, and you need to be able to defend why it found what it found. Neither is possible when the model is searching a haystack on every call.

So the question is not “will it fit?” The question is “should it be there?”

What MCP does, and why it’s different from RAG

The Model Context Protocol is Anthropic’s answer to the question “how do you give a model access to information without stuffing it all into the prompt?”

Conceptually, MCP is a client-server protocol. Your application is the MCP client. Things that hold information or expose capabilities are MCP servers. The model, through the client, can list available resources and tools, then call them on demand. The model decides what it needs and pulls it in, instead of you guessing in advance and stuffing it in.

This sounds like RAG. It is similar at the level of “the model gets information that wasn’t in the original prompt.” It is different in three important ways.

The unit of retrieval is semantic, not arbitrary. With RAG, you typically chunk documents into 500-token windows and let an embedding model decide which ones look relevant. With MCP, the unit is whatever the server author decided is meaningful - a Budget speech section, a calendar event, a Salesforce account record, a Jira ticket. The chunks have boundaries that mean something.

The model decides when to fetch. With pre-retrieved RAG, your application fetches before the model sees the prompt and the model is stuck with what you picked. With MCP, the model can issue a fetch as a tool call, which means it can decide it needs more information, or different information, after seeing the question.

The protocol is standardised. Once Claude knows how to speak MCP, any MCP server works. You don’t write new integration code for every data source. The server author writes the server once and every MCP-compliant client gets access.

What this means concretely. An MCP server is a small program - usually a single file - that exposes a set of resources (things to read) and tools (things to do) over a standard protocol. The server runs as a process on your machine or a remote host. The model, through a client your application provides, can list what the server offers and call into it on demand. For the analyst we’re building, we wrote an MCP server that exposes the Budget speech as a set of named sections. Instead of stuffing the entire speech into the prompt, the analyst lists what’s available, decides what’s relevant to the question, and fetches only those sections. The rest of this post is what happens when that runs.

The MCP server in 100 lines

The repo at capstone/v2-mcp/server.py is the working MCP server for the Budget speech. About 100 lines of Python. The official MCP SDK does most of the work; the server itself declares what’s available and how to fetch it.

The full code is in the repo. Two pieces are worth seeing inline.

The tool that lists what’s available:

python

@mcp.tool()
def list_budget_sections() -> list[dict]:
    """Return the index of all Budget speech sections."""
    return [
        {"id": s.id, "title": s.title, "summary": s.summary}
        for s in SECTIONS
    ]

The tool that fetches a specific section:

python

@mcp.tool()
def read_budget_section(section_id: str) -> str:
    """Return the full text of a named Budget speech section."""
    section = SECTIONS_BY_ID.get(section_id)
    if not section:
        raise ValueError(f"Unknown section: {section_id}")
    return section.text

That’s the protocol-facing surface. The rest of the file loads the sections index from disk and runs the server over stdio. When the analyst process spawns this server, the two functions become tools the model can call. The model sees their names, their descriptions, and their input schemas, exactly as if you’d hand-written them as Anthropic SDK tools.

The Budget speech is split into 20 sections. Some cover individual schemes grouped by sector (manufacturing.strategic_sectors covers the seven frontier sectors). Some cover whole topics (agriculture.farmer_welfare, textile.integrated_programme). Two are large annexures of tax amendments. The splits were done by hand because the speech doesn’t have machine-readable section markers - a one-time fixture prep step, not a runtime concern.

Why this matters: the model now sees the Budget as 20 named, summarised resources it can choose from, instead of a 27,000-token wall of text it has to scan on every call.

Experiment 1: A focused question, watching the model think

Time for the first experiment. The simplest possible test of MCP: ask a question with a clear sectoral focus, and see what the model decides to read.

The question: How does this Budget affect the textile industry?

The setup. The MCP approach (running with_mcp.py in the repo) gives the model two tools: list sections, read section by ID. The model has no advance knowledge of which sections might be relevant. With the --show-trace flag, the script prints every tool call the model makes, so we can watch the reasoning unfold.

What I expected. One fetch. The model would list sections, find one named after textile, read it, summarise it. Done.

What actually happened:

tool list_budget_sections({})
tool read_budget_section({'section_id': 'manufacturing.textile_integrated_programme'})
tool read_budget_section({'section_id': 'taxes.indirect_overview_exemptions_exports'})
tool read_budget_section({'section_id': 'taxes.indirect_sez_ease_customs_exports'})
tool read_budget_section({'section_id': 'annexure.indirect_tax_amendments'})

Five tool calls. The model fetched the textile section, as expected. Then it fetched three indirect-tax sections, because textile exports are affected by customs policy: extended timelines for export realisation, duty-free inputs for shoe uppers, customs procedure changes. The model recognised that a question about “the textile industry” spans Part A (sectoral programmes) and Part B (tax policy), and chose to read both.

The numbers:

Tool calls made:     5 (1 list + 4 reads)
Input tokens:        16,469
Output tokens:       1,418
Cost:                $0.07
Duration:            36 seconds

What this shows, in plain language. The model isn’t pattern-matching “textile” to “textile section.” It’s reasoning about what affects the textile industry, then fetching the sections that contain those things. With a stuffed prompt, you can’t see this reasoning at all - the model either uses the customs information or doesn’t, and you have no way to tell which. With MCP, the reasoning is right there in the trace.

Why this matters in production. When a system’s outputs need to be defended - to a manager, an auditor, a customer - being able to point at the source material it consulted is the difference between “trust the AI” and “trust the process.” The textile trace is a tiny example of a large pattern: agentic systems become accountable systems when their decisions are observable.

Experiment 2: A harder question, baseline approach

The textile question is clean because it has a focused answer. The model fetched four sections, summarised them, done. To really separate the two approaches, we need a question where the answer is scattered across the whole document.

The question: What are all the R&D-related allocations in this Budget, across all sectors?

This is harder for three reasons. There’s no single “R&D section” in the Budget - R&D allocations are scattered across biopharma, semiconductors, textiles, rare earths, agriculture, and several others. The right answer requires identifying the relevant allocations across many sections. And reasonable people might disagree on which allocations count - some are clearly R&D (Biopharma SHAKTI’s research outlay), some are tangential (an SME growth fund that might fund R&D), and some are borderline (the CCUS Mission, which is research-heavy but framed as climate policy).

So the test is twofold: does the system find the clearly-relevant allocations consistently, and how does it handle the judgement calls.

The two approaches. I’ll refer to them by short names from here:

The stuffed approach loads the entire 27,000-token Budget speech into the prompt on every call. The model has the whole document in its working memory. This is the script called stuff_everything.py in the repo.
The MCP approach gives the model two tools: list available sections, and fetch a specific section by ID. The model decides which sections to read. This is the script called with_mcp.py in the repo.

Both will run the same question. Both will run three times so we can see whether the outputs are consistent. Experiment 2 covers the stuffed approach. Experiment 3 covers MCP.

What the stuffed approach produced. Three runs, same question, same prompt, same model:

Run 1:  Biopharma SHAKTI (₹10,000 cr), CCUS Mission (₹20,000 cr)

Run 2:  Biopharma SHAKTI (₹10,000 cr), SME Growth Fund (₹2,000 cr),
        CCUS Mission (₹20,000 cr), Safe Harbour threshold (₹300 cr)

Run 3:  Biopharma SHAKTI (₹10,000 cr), CCUS Mission (₹20,000 cr)

The numbers:

Input tokens per run:                       27,505
Input tokens (3 runs total):                82,515
Cost per run:                               ~$0.10
Cost (3 runs total):                        $0.29
Anchor allocations found in all 3 runs:     2 (Biopharma, CCUS)
Allocations in only some runs:              2 (SME Growth Fund, Safe Harbour)

What this shows. The two anchor allocations - Biopharma SHAKTI and CCUS - appear in every run. So far, so good. The variance is in the judgement calls. Run 2 decided the SME Growth Fund counts as R&D, and so does the Safe Harbour threshold for international tax. Runs 1 and 3 decided neither counts.

These are not unreasonable inclusions. The SME Growth Fund could plausibly fund R&D activities. Safe Harbour rules affect R&D-intensive multinationals. But three runs ago, on the same input, the model made a different call.

Why this matters in production. For a chat conversation, this variance is invisible and acceptable. For a system producing an analyst’s briefing that goes to a CFO, it is not. If the briefing says “the Budget includes ₹2,300 crore of supporting R&D allocations” on Tuesday and “no other R&D allocations” on Wednesday, the analyst has a hard conversation to explain. And critically: nothing in the system tells you why the runs differed. The model searched the same haystack each time and made different judgement calls. You have no way to inspect those calls, no way to constrain them, no way to debug them.

This is what context rot looks like in real applications. The information is in the prompt. The model is finding most of it, most of the time. The variance is in the long tail of secondary decisions - and it’s opaque.

Experiment 3: The same hard question, MCP approach

Same question, three runs, but now the model fetches sections on demand instead of receiving the whole document upfront.

What the MCP approach produced:

Run 1:  11 sections fetched
        ₹10K cr, ₹2K cr, ₹20K cr, ₹300 cr, ₹40K cr, ₹70K cr

Run 2:  6 sections fetched
        ₹10K cr, ₹2K cr, ₹20K cr, ₹30K cr, ₹40K cr

Run 3:  6 sections fetched
        ₹10K cr, ₹2K cr, ₹20K cr, ₹300 cr

The numbers:

Input tokens (3 runs total):                84,623
Cost (3 runs total):                        $0.36
Anchor allocations found in all 3 runs:     3 (Biopharma, SME Growth Fund, CCUS)
Total distinct allocations across 3 runs:   6
Sections fetched (across 3 runs):           23 total

What this shows, first. The MCP approach surfaces a richer list. All three runs hit three anchor allocations rather than two. The model is reading actual sections about R&D-relevant topics rather than skimming a wall of text, and it finds more.

What this shows, second, more importantly. The MCP approach still varies. Run 1 fetched 11 sections and identified six allocations. Run 3 fetched 6 sections and identified four. The variance moved from “which allocations to include?” to “how widely to cast the net?”

But - and this is the difference that matters - the variance is now visible in the trace. Running with --show-trace prints every section the model fetched. You can see that Run 1 read the green-hydrogen mission section and included its ₹70K crore figure. You can see that Run 3 didn’t read that section, so naturally didn’t mention it. You can see exactly where the runs diverged and why.

Why this matters in production. If you don’t want the model to wander into green hydrogen on an R&D question, you can constrain it in the system prompt: “Limit R&D allocations to direct research outlays, not tangential ecosystem funding.” You can verify the constraint worked by looking at the trace on subsequent runs. With the stuffed approach, you couldn’t do this - there were no fetches to inspect, only mysteriously varying outputs.

This is the production-grade difference. Both approaches vary. Only one of them tells you why.

The two approaches side by side

Across three runs of the R&D question, here’s what the two approaches produced:

                                       Stuffed      MCP
Input tokens (3 runs)                  82,515       84,623
Output tokens (3 runs)                 2,584        7,093
Total cost (3 runs)                    $0.29        $0.36
Anchor allocations (all 3 runs)        2            3
Total distinct allocations             4            6
Section-level traceability             None         Full

A few things to read out of this.

The MCP approach is more expensive on this question - about 26 percent more. This surprises most readers and it surprised me. The reason: when the model fetches many sections to answer a cross-cutting question, each section’s text accumulates in the context for subsequent tool calls. The model lists, reads section A, reads section B (with A still in context), reads section C (with A and B still in context), and so on. Broad questions multiply context, even with MCP.

The MCP approach surfaces more allocations - six distinct allocations versus four, and three anchor allocations versus two. Reading sections directly produces better recall than scanning a stuffed document, because the model isn’t trying to find needles in a haystack on every pass.

The MCP approach has full traceability. Every fetch is visible. Every decision the model made about what to read can be inspected. The stuffed approach has none of this. If a CFO asks “why did this run mention X but the previous run didn’t?” - under the stuffed approach, you cannot answer. Under the MCP approach, you can.

Note: cost is the only metric where stuffed “wins” on the broad question. On the same question, MCP surfaced 50 percent more allocations and provided full traceability. Cost is real, but it’s one metric of several.

The honest cost picture

Putting Experiment 1 and Experiments 2-3 together gives a clearer cost story:

                                   Stuffed       MCP           Winner
Focused question (textile)         ~$0.09        $0.07         MCP, by ~20%
Broad question (R&D)               ~$0.10        ~$0.12        Stuffed, by ~26%

The headline “MCP saves money” overclaims and ages badly. The accurate framing is:

MCP optimises context shape, not always cost. On focused questions, the cost win is real. On broad questions, you trade more tokens for more grounding and observability. On every question, you can see what the model retrieved and what it ignored. The cost story is conditional. The architectural story is unconditional.

When prompt caching enters in Post 7, the cost picture shifts again. Stable system prompts and tool definitions get cached at 90 percent off. The bits that vary - which sections get fetched - cost full price. MCP’s economics improve at scale in ways that one-shot calls don’t show. We’ll come back to this.

Summary: the three takeaways

If you stop reading here, take these:

One: MCP is observable in a way that stuffed context is not. Every fetch is visible in the trace. Every decision the model made about what to read can be inspected. When something is wrong, you can see it.

Two: variance is real with both approaches, but MCP moves the variance into the trace. Stuffed context produces variance you cannot see. MCP produces variance you can debug.

Three: cost is conditional, but architecture is not. MCP may or may not save you money on a given question. It always gives you a more grounded, more debuggable system. For production work, that matters more than per-call cost.

Where this leaves the capstone

The analyst at the end of Day 1 could read one document and produce structured output. The analyst at the end of Day 2 can fetch specific sections from a larger document on demand. The model is starting to decide things - what to read, when it has enough information to answer.

It is not yet a full agent. The loop is minimal. There’s no memory between turns. There’s no retry logic, no parallel tool calls, no streaming. We’ve built half an agent. Post 3 builds the rest.

Why this is the foundation for everything that follows

Every later day in this series depends on the shift this post introduces.

Day 3 builds the agent loop properly, with conversation memory and tools the model decides when to call. The MCP server today is one such tool.

Day 4’s orchestration patterns assume the model can fetch what each pattern needs. A routing agent can’t route to specialised sub-agents if every sub-agent has to receive the entire context.

Day 6’s evals can’t evaluate “did the agent retrieve correctly?” if there’s no retrieval step to evaluate.

Day 7’s production economics depend on prompt caching, which depends on most of the prompt being static - which is only true if you’re not stuffing dynamic content into every call.

The shift from pushing context into the model to letting the model pull what it needs matters because every other lesson is built on top of it.

What we’re not using

You’ll have noticed that we didn’t reach for a vector database, an embedding model, or any RAG framework. That’s deliberate. As mentioned in Post 0, this series treats retrieval as a tool the agent calls, not as a step that runs before the model sees the prompt. MCP is one implementation of retrieval-as-tool. Embedding-based RAG is another. They’re not mutually exclusive - an MCP server could absolutely use embeddings under the hood to decide which section to return - but the agentic frame doesn’t require them.

If you finish this series and want to add vector retrieval, the right architecture is: write an MCP server that exposes a search_documents tool, and let that tool use embeddings internally. The agent’s view of the world stays the same.

What to do this week

Clone the repo and run the textile trace. It’s the moment that earns the post.

   git clone https://github.com/gradtensor/build-with-claude.git
   cd build-with-claude
   git checkout post-02
   uv sync
   cp .env.example .env

Add your Anthropic API key to .env, then run:

   uv run python day-02-context-mcp/with_mcp.py "How does this Budget affect the textile industry?" --show-trace

Watch which sections the model chose to read. Notice the customs sections it pulled in. Try a different sectoral question - electric vehicles, or rare earths - and see how the model’s fetching behaviour changes.

Run the variance demo on both approaches. Use any cross-cutting question - “what does this Budget do for women?” or “summarise the manufacturing initiatives” - with --runs 3 on each script. Notice where variance shows up and where it doesn’t.
Add a new section to the MCP server. The repo has 20 sections. Take a portion of the speech the splitter merged with others and split it into its own section. Add it to fixtures/budget-2026/sections/index.json with an id, title, and summary. No code changes needed. Rerun the analyst with a question that should hit your new section. The MCP server picks it up automatically because the protocol is data-driven, not code-driven.
Optional, for those who want to go deeper. Read the MCP specification at modelcontextprotocol.io. It’s short. The whole protocol fits in your head in 30 minutes.

Up next

Post 3 lands next week. Topic: building agentic applications with the Claude API properly. We turn the half-agent from this post into a full agent loop - with parallel tool calls, error handling, retries, streaming, and conversation memory so the analyst remembers what you asked it three turns ago. The MCP server from today becomes one of the agent’s tools. The Budget speech becomes a document the agent can have a real conversation about.

Repo: github.com/gradtensor/build-with-claude. Tag for this post: post-02. Subscribe for the next one, star the repo to see commits as each day ships.

See you in Post 3.

Trust and Reason

Discussion about this post

Ready for more?