Context windows: why AI agents forget and how to fix it
Learn what context windows are, why AI agents run out of memory during long tasks, the "lost in the middle" problem, and five strategies to manage token limits.

Context windows: why AI agents forget and how to fix it
A context window is the maximum amount of text, measured in tokens, that an AI model can process in a single conversation. It's the model's working memory: everything it can "see" at once, including your prompt, the conversation history, tool results, and its own responses.
If you've ever had an AI assistant "forget" something you told it earlier in a long conversation, you've hit the context window limit. The model didn't get confused or ignore you. It literally can't see your earlier messages anymore because they've been pushed out of the window. For AI agents running multi-step workflows with dozens of tool calls, this isn't an edge case. It's the central engineering challenge.
As of early 2026, context windows range from 128,000 tokens to 10 million (Llama 4 Scout), though the usable portion is almost always shorter than the marketing number. That gap between nominal and effective context is where most agent failures happen.
How tokens and context windows work
A token isn't a word. It's a chunk of text that the model processes as a single unit, usually 3-4 characters long. The word "eating" becomes two tokens: "eat" and "ing." A rough conversion: 1,000 words equals about 1,300-1,500 tokens. Code and non-English text tokenize less efficiently, sometimes 2-3x more tokens per word.
Everything inside a conversation shares one context window. Your system prompt, every user message, every assistant response, every tool call and its result, all of it counts toward the same budget. Here's where the sizes stand in March 2026:
200,000 tokens is roughly 150,000 words. Enough for two full novels. Sounds like you'll never run out. You will.
Why AI agents burn through context fast
An agent isn't a chatbot answering one question. It's running a workflow: reasoning about a task, calling tools, processing results, reasoning again, calling more tools. Every step eats context, and it adds up faster than you'd expect.
Here's what a real agent session looks like in token terms:
A single tool call to CoinPaprika's getTickersById takes about 233 tokens: 9 for the function call, ~224 for the JSON response. Sounds small. But an agent doing a portfolio analysis across 20 coins makes 20+ calls. Add reasoning between each, and you're at 10,000+ tokens for what felt like one task.
I hit this wall building an agent that analyzed DeFi pools across multiple chains. Around step 40, the responses started getting weird. Not wrong exactly, but the agent was "forgetting" constraints I'd set in the system prompt. The system prompt was still technically in the window, but buried under 150,000 tokens of tool results and reasoning. That's when I learned the difference between being in the context window and being attended to.
The "lost in the middle" problem
Even before hitting the hard limit, there's a subtler issue that tripped me up. Models don't use long contexts uniformly. Research from Liu et al. (published in TACL 2024) found that LLM performance follows a U-shaped curve: information at the beginning and end of context gets high attention, but information buried in the middle gets missed.
Chroma Research tested 18 frontier models from Anthropic, OpenAI, Google, and Alibaba and found every single one exhibited what they call "context rot." Performance started degrading at just 2,500 words of context. By 5,000-10,000 words, some tasks showed severe accuracy drops. One finding that surprised me: models performed better on shuffled, incoherent text than on logically structured documents. The structural patterns in coherent text apparently interfere with the attention mechanism.
What this means in practice: a model with a 200K context window doesn't give you 200K tokens of reliable working memory. Effective context is always shorter than the number on the marketing page, and there's no standardized benchmark for the gap.
Five strategies for managing context window limits
The good news is that context management is a solved problem if you're deliberate about it. Here are the five approaches that work, with honest tradeoffs.
Tool use instead of data dumping. This is the single biggest win. Instead of pasting a CSV of 100 crypto prices into the prompt (~1,844 tokens of stale data), the agent calls an API and gets exactly what it needs, current to the second. A getTickersById call costs 233 tokens for live data.
RAG (retrieval-augmented generation). Store your knowledge in a vector database, retrieve only relevant chunks at query time. This scales to arbitrarily large knowledge bases without touching context limits. Fair warning though: RAG isn't a five-minute setup. You need an embedding pipeline, a vector store, retrieval tuning, and ongoing quality monitoring. The payoff is worth it, but don't underestimate the operational overhead.
Summarization and compression. When a conversation nears limits, compress the history into a summary and continue from there. MemGPT-style systems where the model manages its own memory reduced total token consumption by 84% in a 100-turn dialogue test. The tradeoff: summarization is itself an LLM call, adding 1-3 seconds of latency and its own token cost.
Persistent memory. The agent writes structured notes to external storage and retrieves them as needed. Anthropic demonstrated this with an agent playing Pokemon that maintained precise tallies across thousands of game steps without the full history in context. This is the only approach that enables multi-session coherence.
Lightweight indexes instead of full documents. DexPaprika's llms-full.txt contains their complete documentation: ~85,000 tokens, roughly 42% of a 200K context window consumed before the agent does anything. Their llms.txt index: ~3,850 tokens. That's a 95% reduction. The index tells the agent what's available; the agent fetches full details only when needed. This lazy loading pattern is the most context-efficient way to give agents access to large doc sets.
Context efficiency in practice: the tool use advantage
The argument for tool use and MCP isn't just about token counts. It's three things that compound.
Recency. Crypto prices change every second. Data pasted into a prompt is stale immediately. A tool call to DexPaprika's MCP server returns the current price at the moment the agent needs it.
Selectivity. An agent analyzing one token doesn't need data for 27 million tokens sitting in context. It calls getTokenDetails for exactly what it needs. The other 26,999,999 never enter the window.
Composability.CoinPaprika's MCP server (23 free tools for centralized exchange data) and DexPaprika's (14 tools for DEX data) let an agent query both without either dataset permanently occupying context. The tool definitions load upfront (~1,000 tokens per server), but actual data flows in and out as needed.
The agents.dexpaprika.com hub and skill files at ~40 lines each give agents minimum viable context to start working. Compare that to loading DexPaprika's full docs into context: 85,000 tokens vs 3,850. That 95% difference is the entire working space for a complex agent task.
Frequently asked questions
Q: What's a token in simple terms?
A: A token is a chunk of text the model processes as one unit, usually 3-4 characters. "Hello world" is 2 tokens. A rough rule: 1,000 English words equals about 1,300-1,500 tokens. Code and non-English text use more tokens per word.
Q: Why does my AI forget what I said earlier?
A: Because your earlier messages were pushed out of the context window. The model can only "see" a fixed number of tokens at once. When the conversation exceeds that limit, old messages are truncated or summarized. The model doesn't choose to forget. It architecturally can't access what's outside the window.
Q: Is a bigger context window always better?
A: No. Chroma Research found all 18 models they tested showed performance degradation as context grew, starting at just 2,500 words. More context also means more cost and more latency. The "lost in the middle" problem means information buried in large contexts gets missed even when it's technically in the window.
Q: How do AI agents manage context during long tasks?
A: Through a combination of tool use (fetching data on demand instead of loading it all), summarization (compressing conversation history), persistent memory (writing notes to external storage), and RAG (retrieving relevant knowledge chunks). The best agent systems use all four approaches together.
Q: Does the Model Context Protocol (MCP) help with context efficiency?
A: Yes. MCP servers let agents call tools on demand instead of loading data into the prompt. A DexPaprika tool call costs ~233 tokens for live crypto data. Pasting equivalent data as text costs 5-10x more and goes stale immediately. The tool definitions themselves are compact at ~1,000 tokens per server.
Q: Will context windows keep getting bigger?
A: They already have, from 4K tokens (GPT-3.5 in 2023) to 10 million (Llama 4 Scout) in under three years. But bigger windows don't solve context rot. The research consistently shows models use long contexts less effectively than short ones. Better context management matters more than bigger windows.
What to remember about context windows
Key takeaways
- Every token counts. System prompts, tool definitions, conversation history, and tool results all share one budget. An agent doing 30-50 steps can burn through 200K tokens in a single session, which is why context management isn't optional for agent builders.
- Don't trust the marketing numbers. All 18 frontier models tested by Chroma showed performance degradation starting at just 2,500 words. Effective context is always shorter than nominal context, and the gap is poorly measured.
- Tool use is the highest-leverage context strategy. Call an API for current data instead of pasting stale CSVs. MCP makes this the default pattern for agent architectures.
- For how agents call external tools, see our guide on tool use. For the retrieval approach to context management, see what is RAG.
Related articles
Coinpaprika education
Discover practical guides, definitions, and deep dives to grow your crypto knowledge.
Cryptocurrencies are highly volatile and involve significant risk. You may lose part or all of your investment.
All information on Coinpaprika is provided for informational purposes only and does not constitute financial or investment advice. Always conduct your own research (DYOR) and consult a qualified financial advisor before making investment decisions.
Coinpaprika is not liable for any losses resulting from the use of this information.