prompt-cache-optimizer - SKILL.md Agent Skill

name: prompt-cache-optimizer description: "Optimize token usage through prompt caching and compression" trigger: "auto" priority: 2

Reduces token costs by 50-90% through intelligent caching and compression.

Query → Embedding → Similarity Search → Cache Hit/Miss
         ↓              ↓
    Vector Store    Return cached or call LLM

Cache hits provide 100% token savings with near-instant response.

Mitigate "lost in the middle" problem:

Working Memory (registers)  → Always in context
FIFO Queue (L1/L2 cache)    → Recent exchanges
Archival Memory (disk)      → Semantic search only

Context Size	Latency Need	Accuracy Need	Strategy
<10K tokens	Any	Any	No compression
10K-50K	Low	High	Light (2-3x)
10K-50K	High	Medium	Moderate (5-7x)
50K-100K	Any	Medium	Aggressive (10-20x)
>100K	Any	Any	Hierarchical + Aggressive

For streaming/long sessions, preserve first 4 tokens as attention sinks:

[attention_sinks (4 tokens)] + [rolling_window (window - 4)]

This maintains model coherence over infinite context.

Hybrid = Dense (semantic) + Sparse (BM25)
Fusion = Reciprocal Rank Fusion (RRF)

Achieves 50-100x document reduction with maintained relevance.

Based on LLMLingua, GPTCache, MemGPT, and StreamingLLM research