Claude's 1M Token Context Window: Promise vs. Practice

What Changed With 1M Context

Claude 3 launched in early 2024 with a 200,000 token context window that was already industry-leading. The Claude models of 2025 extended this to 1 million tokens — a 5x increase that moved the ceiling well beyond the "how do I fit this into context?" problem for almost every practical application.

At 1 million tokens, you can load an entire medium-sized codebase (most production applications are under 500K tokens of source code), a complete legal contract archive, all of a company's customer support tickets from a year, or multiple long technical documents simultaneously. The use cases that were previously impossible without RAG infrastructure become trivially possible with a direct context load.

Anthropic also introduced prompt caching in 2025, which changes the economics significantly. Once a large system prompt or document set is cached on Anthropic's infrastructure, subsequent requests referencing that cache are dramatically cheaper — roughly 90% of the input token cost is eliminated for cached content. This makes the repeated use of large context windows economically viable in ways that straightforward per-token pricing would not.

The Attention Distribution Problem

The headline capability is real. The nuance is in how transformer attention distributes across a million-token window. Empirically, models tend to attend more strongly to content near the beginning and end of the context, with somewhat weaker attention to content in the middle — a pattern researchers call the "lost in the middle" effect, well-documented since 2023.

This doesn't mean middle-of-context information is ignored. It means that for tasks requiring precise recall of specific details from a large document — "what did the contract say about the termination clause in section 14.3?" — stuffing the entire document and trusting the model to find the right passage is less reliable than a targeted extraction approach. The model is reasoning over the full context, but its attention is not uniformly distributed across it.

NIAH (needle-in-a-haystack) benchmarks, which test whether a model can reliably find a specific piece of information hidden in a large context, show Claude performing well at 1M tokens — but "performing well" on a benchmark and "reliably finding the exact clause your lawyer needs" are not the same bar. For retrieval-critical applications, hybrid approaches combining large context with retrieval for specific facts remain more reliable than pure context stuffing.

The Cost and Latency Reality

At Claude's standard pricing, a 1 million token input at $3 per million tokens costs $3 per request before generating a single output token. For applications making hundreds of requests per day, this is a substantial operating cost. Prompt caching reduces this to a fraction for repeated content, but the first load is always full price.

Latency is the other constraint. Processing a 1M token prompt — even with Anthropic's efficient serving infrastructure — adds meaningful time to first token. For interactive applications where response latency is user-facing, this matters. For batch processing applications running overnight or serving async workflows, it often doesn't.

The practical guidance that emerged through 2025 is to treat the large context window as a capability spectrum rather than a binary choice. Use small, focused contexts when you know exactly what information the model needs. Use medium contexts (10K–100K tokens) for document-level tasks where most of the document is relevant. Reserve the full million-token context for tasks where truly comprehensive context genuinely improves the output — and where cost and latency are acceptable for the use case.

Where 1M Context Actually Wins

The use cases where massive context most clearly justifies its cost are those where information synthesis across a genuinely large and interrelated document set is the core task. Legal due diligence — reviewing thousands of pages of contracts and corporate filings to identify specific risk factors and inconsistencies — benefits enormously from loading everything simultaneously rather than doing piecemeal retrieval, because the relevant patterns often span documents that a retrieval step would never relate to each other.

Codebase-wide reasoning is another clear winner. When a developer asks "why does this function behave differently on Tuesdays?" the answer might be in a configuration file, a cron job definition, a middleware component, and a comment in a utility module that haven't been in the same context window at the same time before. Large context makes this kind of cross-file reasoning tractable without building a retrieval layer.

The honest summary is that 1M token context is a powerful capability that solves specific, real problems — and is wasteful for everything else. The engineering discipline required to use it well is knowing when to reach for it and when to use a more targeted approach. That judgment doesn't come from the model spec; it comes from understanding your application's actual information requirements.