MAX_TOKENS = 128_000
BUDGET = {
"system_prompt": 4_000,
"history": 20_000,
"retrieval": 15_000,
"tool_outputs": 8_000,
"response": 4_000
}
# Enforce at orchestration layer before API call
if token_count(history) > BUDGET["history"]:
history = sliding_window(history, BUDGET["history"])
if token_count(retrieval) > BUDGET["retrieval"]:
retrieval = trim_by_similarity(retrieval, BUDGET["retrieval"])
assert sum(BUDGET.values()) <= MAX_TOKENS
Key Terms Used Throughout The Post
-
Token: The basic unit an LLM processes, roughly 0.75 words for English prose, less for code and structured data.
-
Truncation: The removal of tokens when the buffer exceeds capacity; the oldest content is dropped first.
-
Orchestration layer: The framework managing prompt construction and history between your application and the model API.
-
Vector memory: An external store that retrieves semantically relevant content at query time rather than pre-loading it into the prompt.
-
MCP (Model Context Protocol): An open protocol enabling models to pull structured context on demand from external systems during inference.
With traditional software, production failures leave evidence logs, stack traces, error rates. With LLM-powered systems and specifically with context window mismanagement that loop breaks in ways most teams aren't instrumented to catch.
To make this concrete, we'll follow Vurdero, a fictional internal AI code review assistant in a CI/CD pipeline. A stand-in for any LLM feature your team might ship.
Vurdero ran cleanly for weeks. Then it started getting things wrong, not hallucinating, wrong. It cited the right function, then recommended changes that contradicted reasoning it had produced earlier in the same session.
No exception was thrown. No alert fired. The pipeline kept running.
It took a routine PR walkthrough for someone to notice the inconsistency. By then, several reviews had shipped with flawed recommendations.
The root cause wasn't a logic error or a model regression. The context window had filled. The oldest tokens including the architectural constraints established at the start of the session were silently dropped. The model had no idea they were gone.
A context window is the total set of tokens an LLM processes in a single inference call system prompt, conversation history, retrieved documents, tool outputs, and response, all competing for the same finite space.
What Is a Context Window, Really?
"A context window is a fixed-size buffer; once full, oldest content is silently dropped with no warning."
The Mental Model
"Every token competes for the same space: system prompt, history, retrieval, and response all share one finite buffer."
A context window is not memory. It does not persist between calls, learn from previous sessions, or grow as your application accumulates state. It is a fixed-size buffer allocated at inference time, filled during the request, and discarded when the response is returned.
Every token in that buffer competes for the same finite space. In a production LLM system, that typically means:
- System prompt instructions, constraints, tool definitions: ~2,000–5,000 tokens
- Conversation or session history prior turns, agent reasoning steps: ~10,000–50,000 tokens
- Retrieved documents (RAG) chunks pulled from a vector store: ~5,000–20,000 tokens
- Tool outputs API responses, code execution results: ~2,000–10,000 tokens
- Response reserve space allocated for model output: ~2,000–4,000 tokens
On a model with a 128k context window, a moderately complex agentic workflow can consume 60–80% of available space before a single user message is processed.[1] That leaves less headroom than most engineers assume.

What "128k tokens" actually means
128k tokens is not 128k words. Tokenisation varies significantly by content type. As a working ballpark:
- Prose text: ~1 token per 0.75 words
- Source code: ~1 token per 0.4–0.6 words (syntax characters tokenise separately)
- JSON payloads: token-heavy due to brackets, quotes, and key repetition
- Non-Latin scripts (Arabic, Chinese, Japanese, Korean) cost 2–3x more; European languages typically 10–30% more
A 500-line Python file is not 500 tokens. Depending on complexity, it can be 3,000–6,000. A JSON API response that looks small in a terminal window can silently consume thousands of tokens once serialised into a prompt.
The practical implication: by the time you account for tokenisation overhead across all context slots, your effective working budget for actual task content is significantly smaller than the advertised window size. A 128k context window is not 128k tokens of usable space, it is 128k tokens of total capacity, split across every component of your system prompt, history, retrieval, tool outputs, and response. Understanding that distinction before you architect is what separates systems that degrade silently from ones that fail predictably.
Block 2: Token Mechanics
"Token budgets erode faster than expected not from large blocks, but from invisible overhead engineers rarely account for."
Where Tokens Go Silently
Token budgets erode faster than expected not from large content blocks, but from the accumulation of small, invisible costs engineers rarely account for in early design.
Formatting and structure overhead
Markdown, XML tags, JSON keys, and whitespace all tokenise. A well-structured system prompt with headers, numbered lists, and code fences can cost 20–30% more tokens than the same content written as plain prose. Template strings with repeated field names {"role": "user", "content": "..."} across dozens of turns compound this quickly.
Code vs prose
A 200-line function with docstrings, type hints, and inline comments will tokenise to roughly 800–1,500 tokens depending on language and style. The same logic expressed in pseudocode might cost half that. In code review workflows where full file diffs are passed as context, this difference is significant at scale.
Tool output bloat
API responses injected as tool outputs are rarely token-efficient. A paginated REST response with metadata, status codes, and nested keys can consume 3–5x more tokens than the actual payload content an engineer would extract manually. Passing raw tool output without stripping irrelevant fields is one of the fastest ways to saturate a context window unintentionally.
Conversation history growth
In multi-turn agentic workflows, history compounds linearly. Ten turns of agent reasoning each with a tool call, a result, and a response can accumulate 8,000–15,000 tokens before the core task has progressed significantly. Without an explicit strategy for managing history, the buffer fills from the bottom up, and the earliest often most critical context gets evicted first.
Block 3: How the Industry Is Responding
"No solution eliminates the buffer, they only change what goes in, how it's managed, and how much is consumed per call."
Sliding window and hierarchical summarisation
The most common production pattern. Rather than passing full conversation history, older turns are either dropped (sliding window) or compressed into a running summary that replaces raw content. The model sees a condensed representation of past context rather than the original tokens. Straightforward to implement, but lossy summarisation introduces semantic drift over long sessions.
Prompt compression
Token pruning at inference time. Techniques like LLMLingua (Microsoft Research) analyse the prompt and remove tokens that contribute least to model performance redundant phrasing, low-information connectives, repeated context. Can reduce prompt length by 2–5x with acceptable quality loss[2] depending on task type. Model-agnostic and applied at the orchestration layer before the API call is made.
KV cache reuse
At the provider level, repeated prompt prefixes system prompts, static instructions, shared document chunks can have their key-value attention tensors cached and reused across requests rather than recomputed. This does not expand the context window, but reduces latency and cost significantly for workloads with stable prompt prefixes. Supported at the API level by the major frontier model providers.[4]
Vector and graph memory
Rather than fitting all relevant context into a single inference call, external memory systems externalise state entirely. Vector stores enable semantic retrieval of relevant chunks at query time. Graph memory adds relational structure tracking entities and their relationships across sessions. Production systems in this space include MemGPT (now Letta), Mem0, and Zep[5], each with different trade-offs around latency, cost, and memory fidelity.
Model Context Protocol (MCP)
MCP takes a different approach rather than pre-loading context into the prompt, it enables the model to pull structured context on demand from external systems during inference. Tools, resources, and data sources are exposed as MCP servers; the model requests only what it needs, when it needs it. This shifts context responsibility from prompt construction to protocol, reducing upfront token consumption and keeping the buffer available for reasoning rather than retrieval.
Why Do LLM Systems Fail Silently in Production?
"Context mismanagement fails silently — no exceptions, no alerts, just quietly wrong outputs."
Failure Pattern 1:
State Loss in Multi-Turn Agents
The agent remembered everything until it didn't
Multi-turn agentic workflows accumulate context fast. Each reasoning step, tool call, and result gets appended to the buffer. Early in a session this is fine. Twenty steps in, the oldest entries often the task constraints, architectural decisions, or user-defined boundaries established at the start are the first to be evicted.
The model does not know they are gone. It continues reasoning from whatever remains in the buffer, with no signal that its foundational context has been truncated. Outputs remain fluent and confident. The degradation is invisible at the application layer unless you are explicitly instrumenting token usage and context boundaries per turn.
What happens in practice:
An agent tasked with a multi-step code migration begins with clear instructions, target framework, deprecated patterns to avoid, file scope. Thirty tool calls later, those instructions have been pushed out of the window. The agent begins reintroducing the deprecated patterns it was explicitly told to avoid. No errors. No warning. The task continues.
The silent truncation note:
This behaviour is framework-level, not API-level. A direct API call with excess tokens will return a context_length_exceeded error. Silent truncation occurs when the orchestration layer ,the framework sitting between your application and the model API (LangChain, LlamaIndex, custom agent loops) manages history and handles trimming before the API call is made, which is the default behaviour in most production agent frameworks.
Failure Pattern 2:
RAG Context Pollution
Retrieval fills the buffer before reasoning begins
RAG pipelines introduce a specific failure mode that sits upstream of the model entirely. By the time the LLM receives its input, a significant portion of the context window has already been consumed by retrieved chunks regardless of whether those chunks are actually relevant to the query.[3]
What happens in practice:
A retrieval pipeline is configured to return the top 20 chunks from a vector store on every query. Each chunk is 512 tokens. That is 10,240 tokens of retrieved content injected before the system prompt, conversation history, or task instructions are added. On a complex query with a long session history, the model receives a bloated context where retrieved content crowds out the reasoning space it needs to produce a coherent response.
Worse retrieved chunks frequently overlap semantically. Document 3 and Document 7 say the same thing in different words. Document 12 partially contradicts Document 4. The model receives conflicting, redundant information and has no mechanism to resolve it. It synthesises across all of it, producing outputs that look authoritative but are grounded in a polluted context.
Failure Pattern 3:
Cost Explosion at Scale
The bill arrives before the bug does
Context window mismanagement does not just produce wrong outputs it produces expensive ones. In production systems under load, the relationship between token usage and cost is non-linear in ways that catch most teams off guard.
What happens in practice:
A team ships an agentic workflow that passes full conversation history on every API call. In development, sessions are short five to ten turns. Token counts look reasonable. In production, sessions run longer. Power users drive thirty, forty, fifty-turn conversations. Each call carries the entire accumulated history, growing linearly with every turn.
At ten concurrent users this is manageable. At ten thousand, the cost per session has compounded to a point where the unit economics of the feature no longer work. The engineering team built for correctness. Nobody modelled the token cost curve at scale before shipping.
A secondary pattern compounds this: RAG pipelines that retrieve the same static documents on every call. A system prompt that embeds a 10,000-token knowledge base directly rather than retrieving on demand. Long-running agent sessions that re-inject full tool output history regardless of relevance. Each decision is defensible in isolation. Together they produce a context window that fills to capacity on every call at maximum token cost, every time.
Building Context-Aware Systems That Don't Collapse
"Context is a resource to allocate, not a place to put things — treat it like memory from day one."
The failure patterns in the previous section share a common root: context was treated as a byproduct of the application rather than a resource to be managed. The following patterns address that directly not as tutorials, but as design decisions engineers make before the first line of orchestration code is written.
Sliding window with priority scoring Rather than evicting the oldest tokens blindly, a sliding window with priority scoring assigns weight to context components system instructions score higher than mid-session tool outputs, recent turns score higher than older ones. Eviction removes the lowest-priority content first, preserving the context that matters most to the current reasoning step.
Hierarchical summarisation Instead of dropping old turns, compress them. A summarisation layer periodically condenses older conversation history into a running summary, replacing raw turn-by-turn content with a semantically dense representation. The model retains the thread of reasoning without carrying the full token cost of every prior exchange. The trade-off is lossy compression summarisation introduces drift over very long sessions.
Context budgeting Treat token allocation as a first-class engineering constraint, the same way systems engineers treat memory allocation. Define explicit token budgets per context slot system prompt, retrieval, history, tool outputs, response reserve and enforce them at the orchestration layer. Budget overruns become observable failures rather than silent degradation.
A minimal implementation looks like this:
Note : The exact values are workload-specific. The discipline of defining them explicitly is not.
Benchmarking context strategies There is no universally optimal context strategy. Sliding window performs differently on a code review workflow than on a customer support agent. Measure retrieval precision, output quality, and cost per call across representative workloads before committing to a pattern in production. Context strategy is a workload-specific decision.
MCP as an architectural pattern Rather than pre-loading context into the prompt, MCP shifts responsibility to the protocol layer. The model requests structured context on demand tools, resources, data consuming buffer space only for what is actively needed per inference step. For agentic systems with access to many external sources, MCP reduces upfront token consumption and keeps the buffer available for reasoning.
Observability LangSmith and Langfuse Context problems are invisible without instrumentation. LangSmith and Langfuse both provide token-level tracing across agent runs surfacing which context slots are consuming budget, where truncation is occurring, and how cost-per-call evolves over session length. Treat context observability as a production requirement, not a debugging tool.
Conclusion
Engineering Context as a First-Class Concern
"The teams that treat context as a design constraint build systems that hold up — the ones that don't debug silent failures in production."
The context window is not a detail. It is the operating environment your LLM runs in fixed in size, invisible in failure, and expensive when mismanaged at scale. The teams that treat it as an afterthought are the ones debugging silent degradation in production. The teams that treat it as a design constraint build systems that hold up.
Seven Questions Worth Asking Before You Ship
1. Have you defined explicit token budgets for each context slot?
22/05/2026
.png?width=140&height=101&name=glassdoor%20(1).png)
