fundamentals

AI Agent Context Window Management: Why Your Agent Forgets Things, and Four Solutions

30-Second Version · For the impatient

An AI Agent 'forgetting things' isn't a bug — the Context Window is full. Tool return filtering is the highest-ROI solution: trim API data before it enters the LLM, keeping only fields needed for the current decision. Aave market data goes from 15,000 tokens to 200 tokens — Context usage drops 80% immediately.

Chris Vale · June 28, 2026

Full Content +

You ask an AI Agent to execute a complex 50-step DeFi strategy. By step 30, the Agent suddenly 'forgets' a decision it made at step 5, re-queries the same data, or worse, starts contradicting its own step-5 decision.

This isn't a bug, and the LLM hasn't gotten dumber. It's a Context Window management problem — when the Agent's working memory fills up, it must start 'forgetting' things. How you design the strategy for 'what to forget, what to keep' is one of the most core technical problems in Agentic systems engineering.

What Is the Context Window Management Problem

The Context Window is the upper limit of all text an LLM can 'see' in a single inference, measured in tokens. Claude Opus's Context Window is 200,000 tokens, GPT-4o is 128,000, Claude Haiku is 200,000. This sounds large, but for a long-running Agent, this limit is reached faster than you'd expect.

Why does an Agent's Context balloon so quickly? Because a running Agent's Context typically contains: System Prompt (strategy description, tool definitions, safety rules — usually 2,000-10,000 tokens); conversation history (all past rounds of user input and Agent output); tool call records (input parameters + complete API returns for every tool call); and current task working state (what the Agent has done and discovered in this task).

A DeFi strategy Agent that runs inference once per hour, with each inference adding ~2,000 tokens of Context growth (tool returns and decision records). After 200 hours (under 9 days), accumulated Context reaches 400,000 tokens — exceeding most models' limits. In practice, problems appear earlier: when Context exceeds 60-70% capacity, the LLM's 'needle in a haystack' problem starts manifesting — it begins ignoring important information in the middle of Context, only attending to the beginning and end.

The core challenge of Context Window management: you must decide 'when Context is nearly full, what to drop, what to keep' — and this decision directly affects Agent decision quality and behavioral consistency.

How Agent Context Gets Exhausted

Understanding Context consumption patterns helps you predict problems and design solutions in advance. Context has four main consumption sources:

Tool return data accumulation (the biggest Context killer): Each tool call's complete return enters Context. DeFi protocol APIs typically return large amounts of raw data — a single Aave full-market rate query may return 15,000 tokens of JSON (detailed data for dozens of assets). If you put the complete API return into Context every time, a few dozen tool calls can max out the Context. Correct approach: in tool function backend code, filter data before it enters Context — only keep fields truly needed for the current decision. The same Aave market data, if the Agent only needs USDC and ETH supply APY, pass only those two numbers (200 tokens) instead of the complete market data (15,000 tokens).

Linear conversation history growth: Each of the Agent's Thought-Action-Observation cycles leaves a record in Context. Without any management, Context grows linearly until hitting the limit. An Agent executing 48 inference cycles daily could have 50,000-100,000 tokens of conversation history after a week.

Hidden System Prompt cost: The System Prompt is fully included in Context for every inference. A System Prompt describing 10 tools, detailed safety rules, and strategy explanations may run 8,000 tokens. These 8,000 tokens are recalculated every inference — a fixed 'baseline Context cost.'

Error and retry records: When tool calls fail and retry, failed call records also stay in Context. If a tool fails 5 consecutive times, 5 failure records + 5 retry records enter Context. In some frameworks, this significantly amplifies Context size in long-running Agents.

Four Context Management Strategies

No single strategy perfectly solves all scenarios — choose an appropriate combination based on your Agent's task type:

Strategy 1: Sliding Window — simplest, suited to short-task Agents

Set a fixed rule of 'keep the last N rounds of conversation history'; older records are automatically dropped. Example: always keep the last 20 rounds + System Prompt + current tool returns; discard older history. Simple to implement — one line in LangChain (`memory = ConversationBufferWindowMemory(k=20)`). Suitable for: Agents where each task is relatively independent (e.g., each yield optimization is a fresh independent decision not requiring reference to decisions from 2 weeks ago). Limitation: if the Agent needs to reference a step-1 decision at step 21, the sliding window causes it to 'forget' that critical information.

Strategy 2: Summarization — most flexible, suited to long-task Agents

When Context reaches a threshold (e.g., 60% capacity), have another LLM (can be a cheaper small model) summarize and compress the history — replacing thousands of tokens of detailed records with a few hundred tokens of summary. Example: compress 50 rounds of detailed conversation history into 'The Agent executed 47 yield optimizations in the past 48 hours: 38 rebalances to Morpho (APY 5.1%), 9 times stayed in Aave (due to high Gas fees). Cumulative Gas spend: $42, net spread yield: +$89.' This summary is 95% smaller than the original history while preserving key decision patterns. Implementation complexity: medium. Requires designing compression rules for 'what must still be preserved after compression.'

Strategy 3: External Memory (RAG) — best for Agents needing long-term memory

Store all historical operation records in a vector database (Chroma, Pinecone, pgvector). Before each inference, use semantic search to find 'historical records most relevant to the current decision,' placing these relevant records (not the complete history) into Context. Example: Agent deciding whether to rebalance to Compound searches 'all past operations involving Compound,' finds the record '3 weeks ago, operations paused due to Compound smart contract upgrade,' and includes this critical context — even though it's beyond the sliding window. External memory gives the Agent 'selective long-term memory': not remembering everything, but being able to find the most relevant things when needed.

Strategy 4: Structured State Management — most precise, suited to complex multi-step tasks

Instead of keeping the Agent's working state in free-text conversation history, maintain a structured JSON object representing current state — only this state object (not complete history) goes into Context each inference. Example: instead of 'Agent queried Aave at 10:00, queried Morpho at 10:01, decided to rebalance at 10:02, broadcast transaction at 10:03, confirmed at 10:05' all in Context, maintain: {current_position: morpho, current_apy: 5.1, last_rebalance: 10:05, total_gas_spent: 42, operation_count: 38}. This state object has fixed size — no matter how long the Agent runs, the state object size stays essentially constant. LangGraph's design philosophy centers on structured state management.

Context Compression Implementation Methods

Translating the above strategies into executable code implementations:

Tool return filtering (highest priority, immediate results): In each tool function's return value, add a 'filter layer' to trim data before it enters LLM Context. Example: get_defi_rates() raw API return may have complete data for 50 assets, but what's passed to the LLM is only 'USDC and ETH current APY at Aave, Morpho, Compound' — 6 numbers, not 50 assets of complete JSON. This change doesn't require modifying the Agent framework, only the tool function code — the lowest-cost Context optimization with the most immediate impact, often reducing Context usage by 50-80%.

Context monitoring and triggered compression: Before each inference, calculate the current Context's token count (most LLM SDKs provide a count_tokens() function). Set a trigger threshold: when token count exceeds 60% of the model's limit, automatically trigger compression (summarization or dropping old records).

Layered Context design: Divide Context into: 'permanent layer' (System Prompt, never compressed); 'working layer' (current task working state, fixed-size structured JSON); 'history layer' (past conversation records, periodically compressed or dropped per strategy). Different layers have different compression policies, giving precise control over 'what must be kept, what can be compressed, what can be dropped.'

What This Means for Building Your Agent

Context Window management problems rarely appear in early Agent development — when you're first testing, tasks are short and Context is far from its limit. But after deploying to production and letting it run 24/7 for a few days, Context problems appear suddenly: the Agent starts making decisions that contradict its earlier consistency, starts ignoring rules you set in the System Prompt, or the LLM API starts returning 'Context exceeded' errors.

The most practical advice: add Context monitoring during development (log token count for each inference), letting you see Context growth trends before problems appear and design compression strategies in advance. The highest-priority Context optimization is tool return filtering — lowest-cost change, most significant results, often reducing Context usage by 50-80%. This is what any Onchain Agent developer should prioritize first.

Diagram

Feel free to share. Please credit the source.

Ask a Question

Useful Resources

Onchain Data / TVL → Onchain Dashboards → Block Explorer → Prices / Market Data → MCP Servers → LLM Benchmarks → Model Comparison →