The 1M-Token Context Race Is Rewiring AI Agent Design

The week the prompt became the platform

On August 12, 2025, Anthropic made the million-token context window a default design choice with its Claude Sonnet 4 announcement. The update put entire codebases, long SOPs, and multi-day session state inside a single request, and it shipped pricing mechanics and caching that make the feature practical for production use. See the official note in the Claude Sonnet 4 1M context update.

That release landed in a year when Google’s Gemini 2.5 Pro began shipping with a one million token input limit, clearly stated in its developer documentation. In other words, million-token context is no longer a marketing bullet. It is becoming a platform primitive. Google’s docs list an input limit of 1,048,576 tokens and a large output ceiling for the stable 2.5 Pro model. You can confirm the numbers in the Gemini 2.5 Pro model details.

The result is a shift in agent architecture. Retrieval pipelines that once stitched small windows together are giving way to in-context, tool-rich workflows where the agent carries nearly everything it needs inside the prompt. This piece explains the patterns that are emerging and how to migrate without breaking cost models or accuracy.

Why a million tokens changes agent design

A million tokens is not just a bigger bucket. It is a different unit of work. At this scale you can:

Keep a full repository in play, not just the current file and a handful of dependencies. Cross-file issues, architectural constraints, and style rules are available at once.
Load real SOPs, playbooks, and compliance binder material in the same session as tickets and logs. The agent can solve tasks while citing the actual procedure.
Maintain session state across hundreds of tool calls. Instead of reloading context every turn, you keep the working set resident.

When the working set fits natively in the window, your agent stops spending half its time fetching, chunking, sorting, and re-embedding. Gains show up as lower orchestration overhead, fewer context misses, and simpler code paths. Retrieval is still valuable, but it becomes optional and narrow rather than the backbone.

From RAG-heavy stacks to in-context first

Classic RAG exists to compensate for tiny context windows. It shines when you must query very large corpora on demand, or when content is changing minute to minute. But it also creates fragile glue code. Failures creep in through chunking choices, embed drift, filter thresholds, and tool latency.

With million-token windows, an in-context-first approach becomes viable:

At startup, the agent packs its working set. That may include the repository tree with trimmed binaries, a policy pack, API schemas, and a compact history of the ticket or conversation.
Tools are mounted, but they operate against the same in-memory context. Function calls do not need to rehydrate knowledge each time. The model can learn tool affordances by seeing full schemas. For hands-on implications in dev environments, see how agentic coding goes mainstream.
Retrieval is used as a sidecar, not as a spine. You pull in new items when the working set is insufficient, then keep them resident for the rest of the session.

The net effect is fewer moving parts, fewer places to tune thresholds, and far better determinism once the pack is set.

Emerging patterns for long-context agents

1) Orchestrator and worker, with packed tool schemas

A common pattern is a thin orchestrator that plans and supervises, and one or more workers that do the bulk of thinking. The orchestrator carries a slim context, often with governance rules and a task graph. The worker holds the heavyweight context pack and the tool schemas.

Pack the tool catalog inside the worker’s system prompt using concise JSON or BNF-like schema summaries. Include examples of the most common argument combinations. The goal is to make tool selection a first-class reasoning step rather than a blind function call.
Give tools clear, stable names and version tags. Avoid synonyms. The model learns by pattern, so naming hygiene reduces wrong tool calls.

2) Prompt caching as a memory tier

Prompt caching becomes a new layer in the memory hierarchy. You pin the heavy but mostly static parts of the pack, then reference them across turns.

Pin policy binders, API references, and code that changes slowly. Let dynamic items like logs and diffs stream in per turn.
Cache keys should align with natural versioning. For code, use commit hashes. For SOPs, use document versions. Stable keys increase hit rates and avoid subtle desync.
Treat cache TTL as governance. Short TTL for sensitive data, longer for public docs.

Anthropic’s release explicitly calls out cost and latency benefits when caching large prefixes. Most developers will see the best wins from pinning schemas, policy, and repository indexes while leaving per-task deltas uncached.

3) Ephemeral versus governed long-term memory

Two memory modes are emerging:

Ephemeral session memory. Lives in the context window and cache only. It expires with TTL or at session end. Ideal for exploratory work and privacy sensitive tasks.
Governed long-term memory. Durable storage for facts, decisions, and summaries that must survive across sessions. It sits outside the prompt in a database and is re-packed when needed. Changes require audit trails, retention policies, and PII scanning. For the controls and risks to watch, review why agent hijacking is the gating risk.

Draw a bright line between the two. Ephemeral memory lowers risk and cost. Governed memory increases reliability and continuity but needs policy and observability.

Cost and latency in the million-token era

Bigger windows raise two questions: how to budget tokens and how to keep latency acceptable.

Token pricing tiers. Anthropic’s published prices for Sonnet 4 shift above 200k tokens. Prompts at or below 200k are priced lower per million tokens than prompts above 200k. Past that threshold the per-token price for input roughly doubles and the output price rises as well. These tiers create a strong incentive to keep the hot working set under the lower band when you can.
Output limits matter. Long prompts with short answers are often cheaper than the reverse. Tighten the requested output format, stream partials, and avoid verbose restatements.
Cache to cut cold-start costs. Pinning a large static preamble reduces both startup latency and the effective cost per turn since subsequent calls reference the cached prefix.
Plan for parallelism. Large contexts increase single-call latency. Hide it with parallel worker calls on independent subgoals, then merge the results. Use bounded fan-out to keep cost predictable.
Use batch modes where available. Batching amortizes overhead when issuing similar prompts with minor variations.

A practical rule of thumb: design the pack to fit a comfortable lower tier first, and only exercise the full million when the task clearly benefits. Track cache hit rates as a first-class KPI, since hit rates often drive both cost and latency.

Failure modes to design around

Million-token context is powerful, but it can fail in new ways.

Token budgeting collapse. If everything is important, nothing is salient. A bloated pack can bury the crucial lines. Enforce a budget per category. Code references, tools, policy, and history should each have caps.
Salience drift. The model may anchor on early parts of a long prompt and underweight newer material. Fight drift with recency markers, structured headings, and explicit references like section IDs.
Hallucinations at scale. When the model has enough text to sound authoritative, mistakes are more convincing. Require citations to specific sections and line ranges from the pack, and penalize answers without grounded references in your evals.
Tool confusion. With dozens of functions in context, the agent may pick a plausible but wrong tool. Distinct names, versioned schemas, and a tiny number of example calls for each tool reduce confusion.
Context skew across turns. If you regenerate a pack partially, stale fragments can linger in cache. Use versioned pack manifests and invalidate by key rather than by guesswork.
Over-trust in in-context data. Long context does not solve recency for fast-changing sources. Keep a retrieval path for live data like incidents, prices, and inventory, then write back only curated summaries. For monetized actions that touch payments, align with the era of paying agents.

A migration blueprint for today’s agent apps

You do not need a rewrite. Migrate in phases, measure each step, and keep a rollback.

Phase 1: Inventory and pack the working set

List the materials the agent needs to be correct: code paths, configs, API specs, SOPs, policy, and the minimal session history.
Build a packer that produces a deterministic bundle. Prefer text and structured snippets over binary assets. Include a manifest with version hashes, sizes, and source paths.
Strip noise. Remove duplicate lines, generated files, and trivial logs. Deduplicate near-identical snippets. Every saved kilobyte improves salience.

Phase 2: Prompt-packing strategies that age well

Hierarchical scaffolding. Start with a short overview of the pack layout, then include sections by category with clear headings and boundaries.
Stable anchors. Number sections and files. Teach the agent to cite Section 3.2 or File A15 to force grounded reasoning.
Tight formats. Use compact tables or bullet lists instead of prose for reference data. For code, keep only the necessary functions and signatures, plus links or IDs to fetch more on demand.
Rolling history window. Summarize old turns into structured bullet points with references to pack IDs. Carry the summary, not the raw chat, as you move forward.

Phase 3: Tool schema and embed hygiene

Pack tool schemas in a consistent style. Each function gets a name, purpose, arguments with types and constraints, and 2 to 3 crisp usage examples.
Keep names unique and short. Prefer deploy_app_v2 over deployApplicationNew. The model resolves collisions poorly.
Version and deprecate. Include a deprecation note in schemas that should not be used, and set an allowlist the orchestrator enforces.
If you still use embeddings for retrieval, unify the pipeline. One embedding model, consistent preprocessing, and reproducible chunking. Put chunk IDs into the pack so the agent can request exact chunks when it must fetch more.

Phase 4: Streaming and compute plans

Stream planning tokens separately from final answers. Let the orchestrator receive partial plans, then spawn workers early.
Multi-stage streaming. Stage 1 triage confirms scope and cites what it will use. Stage 2 produces results with per-step citations. Stage 3 emits artifacts or tool calls.
Microbatch similar tasks. If you have 20 tickets that share the same pack prefix, run them as a batch to hit caches and amortize startup costs.

Phase 5: Evals and telemetry for long context

Track pack size, token budget by category, and cache hit rates. Alert on drift.
Measure tool call precision and recall. Precision means the right tool, recall means the agent used a tool when it should have.
Build groundedness evals. Have the agent return section IDs or line ranges that support its claims, then automatically check those references.
Add cost and latency SLOs per workflow. For each path, keep a 95th percentile latency and a max token budget. The orchestrator should degrade gracefully when limits are hit.

Phase 6: Governance and safety

Ephemeral by default. Use ephemeral session memory for exploration. Promote to governed memory only with explicit user consent and an audit entry.
Retention rules per data class. Different TTL for public docs, customer PII, source code, and vendor contracts. TTL becomes a system control, not just a billing knob.
Redaction at pack time. Scrub secrets and tokens, and replace them with references to a secure secret manager. Never inline secrets in long context.
Human-in-the-loop on sensitive changes. Require review for actions that modify production systems, change access rights, or update policy.

What still belongs to RAG

Long context reduces the need for constant retrieval, but it does not eliminate retrieval entirely.

High churn content. Incidents, prices, and inventory change too quickly. Keep a retrieval lane for live data and keep its outputs small and grounded.
Web grounding and compliance. For claims that require up-to-the-minute sources or formal citations, retrieval remains necessary.
Cost-sensitive bulk jobs. If you can answer from a small slice of a huge corpus, retrieval can be cheaper than loading and caching a massive pack.

Use RAG as a precise instrument rather than a default. Fetch small, confirm sources, and then promote only the essential pieces into the pack.

A practical 30-60-90 plan

Days 1 to 30: Ship a pilot on one workflow. Build the packer, define the token budget by category, add caching, and wire in basic telemetry. Target a stable 30 to 40 percent cache hit rate and a 20 percent latency reduction versus your baseline.
Days 31 to 60: Expand to two more workflows, add groundedness evals with section IDs, and introduce governed memory for one durable data type with audit trails. Track tool precision and recall and fix schema naming issues that show up.
Days 61 to 90: Optimize cost. Batch where you can, increase cache hit rates with better version keys, and refactor prompts to stay in the lower price tier for most turns. Add red-team tests for hallucinations at scale and enforce SLOs for both cost and latency.

The future is pack-first

The immediate lesson of million-token windows is simple. Reduce glue code, move the working set into the prompt, and let tools operate against a shared, stable context. You will ship faster, debug less, and give users a system that behaves more like a thoughtful partner than a frantic librarian.

With Anthropic and Google both normalizing million-token inputs, the question is not whether to go long context, but how to do it safely, cheaply, and predictably. Start by packing with discipline, cache what you can, govern what you must, and measure everything. The agents you run next quarter can be simpler and smarter at the same time.