name: clsa-cross-layer-sparse-attention description: Cross-layer sparse attention sharing routing index across decoder layers for 7.6x decoding speedup and 17.1x throughput improvement at 128K context version: 1.0.0 category: ai_collection tags: [deep-learning, transformer, attention, efficiency, long-context] arxiv: 2606.06467v1 paper_title: "You Only Index Once: Cross-Layer Sparse Attention with Shared Routing" authors: ["Yutao Sun", "Yanqi Zhang", "Li Dong", "Jianyong Wang", "Furu Wei"] published: 2026-06-04 activation_keywords: [sparse attention, cross-layer, KV-sharing, routing index, long-context, decoding efficiency, YOCO]
CLSA: Cross-Layer Sparse Attention
Core Innovation
Share routing index across cross-decoder layers (not just KV cache), computing token-level top-k selection once and reusing across layers.
Problem Addressed
Existing sparse attention trade-offs:
- Block sparse: Strong acceleration, noticeable quality loss
- Token sparse: More accurate, limited speedup (expensive routing)
Methodology
Architecture (built on YOCO)
Cross-decoder layers:
├── Shared KV cache (YOCO base)
├── Shared routing index (CLSA innovation)
│ └── Compute once → reuse across layers
└── Token-level top-k selection
Key Benefits
- Index once, reuse everywhere: Amortize routing overhead
- Fine-grained selectivity: Preserve token-sparse accuracy
- Joint optimization: Pre-filling + KV-cache + decoding
Performance
- Decoding speedup: 7.6x at 128K context
- Throughput: 17.1x overall improvement
- Quality: Maintains accuracy across benchmarks
Implementation Pattern
class CrossLayerSparseAttention:
def __init__(self, base_model, top_k_ratio=0.1):
self.kv_cache = SharedKVCache() # YOCO-style
self.router_index = None # Shared across layers
self.top_k = top_k_ratio
def compute_routing_index(self, query, kv_cache):
# Compute once for all cross-decoder layers
self.router_index = self.select_top_k_tokens(
query, kv_cache, self.top_k
)
return self.router_index
def forward_layer(self, layer_idx, query):
# Reuse precomputed routing index
if layer_idx == 0:
self.router_index = self.compute_routing_index(query, self.kv_cache)
# Apply sparse attention with shared index
sparse_kv = self.kv_cache[self.router_index]
return sparse_attention(query, sparse_kv)
Inference Optimization
Bottleneck Resolution
Traditional sparse attention:
├── Pre-filling: O(N) routing per layer
├── KV-cache: O(N) storage
├── Decoding: O(N) routing overhead
CLSA optimization:
├── Pre-filling: Single routing computation
├── KV-cache: Shared storage (YOCO)
└── Decoding: Index reuse across layers
Use Cases
Optimal scenarios:
- Long-context reasoning (128K+ tokens)
- Chain-of-thought generation
- Reasoning-heavy inference
- Multi-turn dialogue systems
- KV-cache constrained deployment
Best suited for:
- Models with cross-decoder architecture
- Applications requiring real-time inference
- Long-context question answering
- Document summarization and analysis
Activation
Trigger when discussing:
- Long-context LLM optimization
- Sparse attention efficiency
- KV-cache reduction strategies
- Cross-layer attention sharing
- Decoding bottleneck mitigation
- Token vs. block sparse trade-offs
Key Insight
Sharing routing index across layers preserves fine-grained token selectivity while eliminating routing overhead per layer.
Related Patterns
- YOCO (KV-sharing base architecture)
- Structured block sparse methods
- Token sparse attention
- Streaming attention optimization
References
- Paper: arXiv 2606.06467v1
- Categories: cs.CL, cs.AI, cs.LG
- Base architecture: YOCO
- Key contribution: Cross-layer routing index sharing