Context Windows
Goal
Understand how much information a model can use at once and how to manage that limit when building AI agents.
Why It Matters
A context window is the model's working memory for a single request. It is the maximum amount of input and generated output the model can attend to at one time, usually measured in tokens.
For agents, context windows decide how much of the task state, conversation history, tool output, retrieved documents, system instructions, and draft work can fit into the next model call. If the useful information does not fit, the agent may forget requirements, ignore earlier tool results, repeat work, or answer from incomplete evidence.
Study Notes
Core Idea
Models do not remember everything forever. During inference, they only see the tokens included in the current prompt plus any tokens they generate. The context window is the upper bound for that combined sequence.
Context is not only the user's visible message. It can include:
- system and developer instructions
- previous conversation turns
- tool definitions
- tool results
- retrieved RAG chunks
- files or code snippets
- examples used for few-shot prompting
- the model's own planned or generated output
This means a "128k token model" does not give the user 128k tokens of free space. Some of that budget is already spent before the user's content is added.
Tokens, Not Words
Context is measured in tokens, not pages or words. A token can be a word, part of a word, punctuation, whitespace, or another text fragment depending on the model's tokenizer.
A rough estimate for English prose is about 1.3 to 1.5 tokens per word, but the real count varies by model, language, formatting, code, and symbols. Code, tables, JSON, and non-English text can consume tokens differently than plain English paragraphs.
What Happens When Context Is Too Large
When the prompt exceeds the model's limit, the application must reduce it before the call can run. Common strategies include truncating old messages, summarizing history, selecting fewer retrieved chunks, shrinking tool outputs, or splitting the task into smaller calls.
If this is done carelessly, the agent can lose important state:
| Context problem | Agent failure |
|---|---|
| Old requirements are truncated | The final answer ignores a constraint |
| Tool output is pasted in full | The model spends attention on irrelevant rows |
| Too many RAG chunks are included | The answer cites weak or conflicting evidence |
| Long conversation history is kept raw | Latency and cost rise every turn |
| Important facts appear in the middle of a long prompt | The model may underuse them |
Larger Is Useful, But Not Free
Long context windows are valuable because they let a model inspect longer documents, larger code samples, richer tool traces, and more conversation history. They can reduce the need for aggressive summarization.
The tradeoff is that long context usually increases cost and latency. Attention over long sequences is computationally expensive, and the model may still fail to reliably use every relevant detail. A larger window should be treated as more capacity, not as a replacement for good context design.
Long Context vs RAG vs Memory
These techniques solve different problems:
| Technique | Best for | Risk |
|---|---|---|
| Long context | Reading a known large input in one call | Expensive, slower, may distract the model |
| RAG | Pulling the most relevant external knowledge into the prompt | Retrieval can miss or rank badly |
| Summarized memory | Preserving durable task facts across turns | Summary can omit important nuance |
| Structured state | Tracking exact values, decisions, and progress | Requires explicit design and updates |
For agents, combine them. Keep exact task state in structured data, retrieve external knowledge when needed, summarize old conversation only when it is not critical verbatim, and reserve long context for cases where the model truly must compare or reason over a large input.
Context Design Checklist
Before sending an agent prompt, ask:
- What information is required to make the next decision?
- Which old messages can be summarized or removed?
- Which tool outputs should be compressed into tables, IDs, or key facts?
- Which retrieved chunks are directly relevant to the current step?
- How many output tokens must be reserved for the answer?
- What should happen if the request does not fit?
Good agent systems treat context as a budget. They decide what earns a place in the prompt instead of appending everything.
Practice
Build a small script or notebook that compares three prompt strategies for the same task:
- Send a full long document or conversation history.
- Send a short summary plus the current user question.
- Send only the most relevant retrieved sections plus structured task state.
Record:
- input token count
- output token count
- latency
- estimated cost
- answer quality
- missing or ignored facts
Then write a short note explaining which strategy you would use for an agent and why.
Mini Project
Create a simple context manager for an agent loop.
It should keep:
goal: the user's taskconstraints: requirements that must not be forgottenfacts: confirmed information from tool callsrecent_messages: the latest conversation turnsretrieved_context: only the top relevant chunks for the current steptoken_budget: the maximum input tokens allowed
The manager should produce a prompt that fits the budget by prioritizing:
- system instructions
- goal and constraints
- confirmed facts
- current user message
- relevant retrieved context
- recent conversation history
Conversation history should be the first thing shortened unless the task requires exact wording.
Exit Criteria
- You can explain a context window in plain language.
- You know that context is measured in tokens, not words.
- You can list what consumes context in an agent call.
- You can explain why bigger context can increase cost and latency.
- You can choose when to use long context, RAG, summarized memory, or structured state.
- You can design a fallback when the prompt does not fit.