name: context-engineering description: | Context window management, token optimization, and memory patterns for efficient multi-agent systems.
Use when: optimizing token usage in an agentic pipeline, designing memory scope for short / long-term / episodic state, or applying a context-loading strategy (anticipatory / JIT / hybrid). portability: portable phase_relevance: ["design", "review"] archetype_relevance: ["*"]
Context Engineering
Techniques for managing context windows, optimizing token usage, and designing efficient memory systems for agentic applications.
Context Window Fundamentals
Context Window: Maximum tokens an LLM can process in a single request (input + output).
Limits vary by provider and model. Check the active model's documentation for the exact value.
Token Efficiency Matters:
- Cost: Charged per token (input + output)
- Latency: More tokens = slower response
- Quality: Irrelevant context can confuse model
State Management Patterns
| Pattern | Use when | Pros | Cons |
|---|---|---|---|
| Shared | Agents need synchronized view | Consistency, simple coordination | Contention, single point of failure |
| Isolated | Agents operate independently | No contention, parallel execution | Inconsistency possible, harder to coordinate |
| Checkpointed | Long-running processes, need recovery | Fault tolerance, replayability | Storage overhead, consistency complexity |
Token Optimization Techniques
1. Aggressive Summarization
Compress old context into summaries to reduce token usage.
2. Selective Context Loading
Only load relevant context based on the current task.
3. Structured Compression
Use JSON/structured formats instead of prose to reduce tokens.
Example:
- Before: "The user's name is John Smith..." (verbose)
- After:
{"name": "John Smith", ...}(compact)
4. Lazy Loading
Load details only when explicitly needed.
5. Reference Instead of Embedding
Reference external documents instead of embedding full text.
See refs/selective-loading.md and refs/caching-and-optimization.md for code examples and detailed strategies.
Memory Patterns
| Memory | Scope | Size | Retention |
|---|---|---|---|
| Short-term (working) | Current session/task | 1K-10K tokens | Minutes to hours |
| Long-term | Cross-session, permanent | Unbounded (vector DB) | Days to forever |
| Episodic | Historical events | Summaries stored | Varies by importance |
See refs/compression-techniques.md for implementation patterns.
Prompt Engineering for Agents
Role Definition
Be specific about agent's role and boundaries.
Example:
You are a Python code reviewer specializing in security.
Your job is to identify security vulnerabilities.
You do NOT review style or performance.
Task Specification
Clear, actionable instructions with explicit format.
Bad: "Review this code." Good: "Review for security: 1) SQL injection 2) Input validation 3) Secrets. Output: JSON with vulnerabilities."
Format Control
Specify exact output format to reduce tokens.
Few-Shot Examples
Show examples for complex tasks.
See refs/selective-loading.md for detailed prompting patterns.
Context Loading Strategies
| Strategy | Pros | Cons |
|---|---|---|
| Anticipatory | Faster response time (load before needed) | May load unnecessary data |
| Just-in-Time (JIT) | Minimal token usage (load only when needed) | Latency on each request |
| Hybrid | Balanced (core context + JIT for task-specific) | More complex implementation |
Cost Modeling
Token Cost Calculation
Track input and output tokens separately. Rates vary by model (typically $0.003-0.075 per 1K tokens).
Budget Enforcement
Set hard token limits per agent/session to prevent runaway costs.
Multi-Agent Cost Attribution
Track costs per agent to identify expensive components.
See refs/cost-calculation-budget.md and refs/cost-optimization-reporting.md for detailed cost strategies.
Context Window Strategies by Agent Pattern
Sequential Pattern: Pass only output of previous agent, not entire chain.
Hierarchical Pattern: Parent gets summaries from children, children get only relevant task context.
Collaborative Pattern: Shared context (compressed), each agent adds only delta.
Autonomous Pattern: Minimal shared context, isolated context per agent.
Quick Wins
- Compress old messages: Summarize history > 20 messages
- Use structured outputs: JSON instead of prose
- Lazy load details: Only when needed
- Set token budgets: Hard limits per agent/session
- Monitor token usage: Track and optimize high-cost agents
References
refs/compression-techniques.md- Conversation summarization, deduplication, entity compressionrefs/selective-loading.md- Relevance filtering, time decay, token-budgeted retrievalrefs/caching-and-optimization.md- Prompt caching, semantic caching, batching, cost-aware model selectionrefs/cost-calculation-budget.md- Token pricing, cost calculation, budget managementrefs/cost-optimization-reporting.md- Cost estimation, optimization strategies, reporting