name: "context-management" description: 'Manage LLM context windows efficiently. Use when implementing context compaction, conversation summarization, token budget management, sliding window strategies, or optimizing prompt length for cost and quality.' metadata: author: "AgentX" version: "1.0.0" created: "2025-06-15" updated: "2025-06-15" compatibility: frameworks: ["langchain", "microsoft-agent-framework", "openai", "anthropic", "azure-openai"] languages: ["python", "typescript", "csharp"]
Context Management
Purpose: Maximize the effective use of LLM context windows through compaction, summarization, and strategic token allocation.
When to Use This Skill
- Managing long conversations that exceed context window limits
- Implementing context compaction for multi-turn agent interactions
- Designing token budget allocation across system prompt, context, and history
- Building summarization pipelines for conversation history
- Optimizing prompt length for cost efficiency without quality loss
- Managing context in multi-agent handoffs
Prerequisites
- Understanding of target model's context window size
- Token counting library (tiktoken, cl100k_base, or equivalent)
- Access to LLM for summarization (can be same or cheaper model)
Decision Tree
Context getting too long?
+- Single conversation overflow?
| +- Recent messages most important? -> Sliding window
| +- Full history needed? -> Progressive summarization
| +- Mixed importance? -> Hybrid (summary + recent window)
+- Multiple data sources competing for tokens?
| +- Prioritize by relevance -> Dynamic token budgeting
| +- All required? -> Compress each source independently
+- System prompt consuming too many tokens?
| +- Load instructions on demand (progressive disclosure)
| +- Split into core (always) + situational (on-demand)
+- Multi-agent context?
| +- Full context transfer? -> Summarize before handoff
| +- Selective transfer? -> Extract relevant artifacts only
+- Cost optimization?
+- Reduce input tokens -> Compaction + caching
+- Reduce output tokens -> Constrain response format
Context Window Budgeting
Token Budget Template
Total Context Window: N tokens
|
+-- System Prompt: 10-15% (instructions, role, constraints)
+-- Retrieved Context: 30-40% (RAG chunks, documents)
+-- Conversation History: 20-30% (recent messages + summary)
+-- Current User Message: 5-10% (the actual request)
+-- Reserved for Output: 15-20% (model's response tokens)
|
= 100% allocated (MUST NOT exceed window)
Budget by Model
| Model | Context Window | Practical Budget | Notes |
|---|---|---|---|
| GPT-4o | 128K | ~100K input | Reserve 28K for output |
| Claude 3.5/4 | 200K | ~160K input | Reserve 32K for output |
| Llama 3.1 70B | 128K | ~100K input | Quality degrades past 64K |
| Gemini 2.0 | 1M+ | ~800K input | Use selectively; cost scales |
Important: "Lost in the Middle"
Models pay less attention to information in the middle of long contexts.
Placement strategy:
- Put critical info at the START (system prompt, key instructions)
- Put important context at the END (most recent, most relevant)
- Middle is for supporting detail (less critical context)
Compaction Strategies
1. Sliding Window
Keep only the last N messages. Simple but loses early context.
Strategy: Keep last K messages (e.g., K=20)
Pros: Simple, predictable token usage
Cons: Loses early conversation context
Best for: Stateless interactions, customer support
2. Progressive Summarization
Summarize older messages, keep recent ones verbatim.
Conversation Messages:
[M1, M2, M3, M4, ..., M20, M21, M22, M23, M24, M25]
|______________| |________________________|
Summarized into + Kept verbatim (recent window)
1-2 paragraphs
Result: [Summary of M1-M20] + [M21, M22, M23, M24, M25]
Implementation rules:
- MUST preserve key decisions, facts, and user preferences in summaries
- MUST trigger summarization before hitting 80% of token budget
- SHOULD use a cheaper/faster model for summarization
- SHOULD include action items and unresolved questions in summaries
- MAY use hierarchical summarization (summary of summaries)
3. Hierarchical Summarization
For very long sessions, create a tree of summaries.
Level 0: Raw messages (most recent 10)
Level 1: Summary of messages 11-50 (paragraph)
Level 2: Summary of messages 51-200 (sentence)
Level 3: Summary of messages 201+ (key facts only)
Context = Level 3 + Level 2 + Level 1 + Level 0
4. Selective Extraction
Extract only relevant information based on the current query.
Current Query: "What did we decide about the database?"
|
v
[Scan History] -> Extract messages mentioning: database, schema, migration, PostgreSQL
|
v
[Compact Context] = Extracted relevant messages + recent window
5. Entity-Based Compaction
Track entities and their latest state instead of full history.
Entity Store:
user_name: "Alice"
project: "AgentX"
decision_db: "PostgreSQL with pgvector"
decision_auth: "Entra ID + MSAL"
pending_question: "How to handle migration rollbacks?"
Context = Entity state snapshot + recent messages
Multi-Agent Context Transfer
Handoff Compaction
When transferring context between agents:
Agent A (completed work)
|
v
[Context Compactor]
|
+-- Extract: Key decisions, artifacts, requirements
+-- Remove: Internal reasoning, rejected alternatives, debugging
+-- Format: Structured handoff document
|
v
Agent B (receives compressed context)
Handoff Document Template
## Handoff: [Source Agent] -> [Target Agent]
### Context
- Issue: #{issue_number} - {title}
- Status: {current_status}
### Key Decisions
1. {decision_1}
2. {decision_2}
### Artifacts Created
- {file_path_1}: {description}
- {file_path_2}: {description}
### Requirements for Next Agent
- {requirement_1}
- {requirement_2}
### Open Questions
- {question_1}
Token Counting and Monitoring
Implementation Pattern
1. Count tokens in each context component
2. Check against budget allocations
3. If over budget:
a. Compress conversation history first (summarize)
b. Reduce retrieved context (fewer/shorter chunks)
c. Trim system prompt (progressive disclosure)
4. Log token usage for monitoring
5. Alert if consistently near limits
Token Counting Rules
- MUST count tokens before every LLM call
- MUST reserve output tokens (never use 100% for input)
- MUST account for message formatting overhead (~4 tokens per message)
- SHOULD use model-specific tokenizer (tiktoken for OpenAI, etc.)
- SHOULD log token usage per component for optimization
- MAY implement a token budget middleware that auto-compacts
Caching Strategies
| Cache Type | What | When | Benefit |
|---|---|---|---|
| Prompt Caching | System prompt + static context | Same prefix across calls | Reduced latency + cost |
| Semantic Cache | Answers for similar queries | Repeated/similar questions | Major cost savings |
| Summary Cache | Conversation summaries | Between summarization triggers | Avoid re-summarizing |
| Entity Cache | Extracted entity states | Updated incrementally | Fast context reconstruction |
Core Rules
- Budget before sending - Count tokens in every context component and verify total stays within the model window before each LLM call
- Reserve output tokens - Always reserve 15-20% of the context window for the model's response; never use 100% for input
- Critical info at edges - Place essential instructions at the start and most relevant context at the end to avoid the "lost in the middle" problem
- Summarize before overflow - Trigger progressive summarization when context reaches 80% of the token budget, not after
- Preserve decisions in summaries - Summaries MUST retain key decisions, user preferences, and unresolved questions
- Model-specific tokenizer - Use the correct tokenizer for the target model (tiktoken for OpenAI, etc.) and account for message formatting overhead
- Log token usage - Track token counts per component (system prompt, history, RAG context) so you can optimize the largest consumer
Anti-Patterns
| Anti-Pattern | Why It Is Bad | Do Instead |
|---|---|---|
| Stuffing entire history into context | Wastes tokens, degrades quality | Progressive summarization |
| No output token reservation | Response gets truncated | Reserve 15-20% for output |
| Same budget for all queries | Simple queries waste tokens | Dynamic allocation based on query complexity |
| Summarizing too aggressively | Loses critical details | Keep key decisions and entities |
| Ignoring "lost in the middle" | Model misses important info | Place critical info at start/end |
| No token monitoring | Silent quality degradation | Log and alert on token usage |
Scripts
| Script | Purpose | Usage |
|---|---|---|
scaffold-context-manager.py |
Generate context management module | python scaffold-context-manager.py --strategy progressive-summary --model gpt-4o |
Troubleshooting
| Issue | Solution |
|---|---|
| Model ignores early context | Move critical info to start or end; summarize middle |
| Summaries lose important details | Improve summary prompt; include structured extraction |
| Token count mismatch | Use model-specific tokenizer; account for message overhead |
| High latency from summarization | Use faster/cheaper model; cache summaries; batch trigger |
| Context too short after compaction | Increase budget or reduce system prompt; use progressive disclosure |
References
Related: RAG Pipelines for retrieval context | Cognitive Architecture for memory systems | Prompt Engineering for effective prompts