spikingbrain2-0-foundation-models - SKILL.md Agent Skill

name: spikingbrain2.0-foundation-models category: ai_collection paper: arXiv:2604.22575v1 description: "SpikingBrain2.0 (SpB2.0) - 5B parameter brain-inspired foundation model with Dual-Space Sparse Attention (DSSA) for efficient long-context and cross-platform inference. Features INT8-Spiking coding for neuromorphic execution and FP8 for GPU acceleration. Activation: SpikingBrain2.0, DSSA, sparse attention, neuromorphic foundation model." date: 2026-04-28

SpikingBrain2.0: Brain-Inspired Foundation Models

Overview

Paper: SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
arXiv: 2604.22575v1
Authors: Yuqi Pan, Jinghao Zhuang, Yupeng Feng, Fangzhi Zhong
Published: 2026-04-24
Categories: cs.LG (Machine Learning)

Key Contributions

Dual-Space Sparse Attention (DSSA) - Hybrid inter-layer sparse attention mechanism
Dual Quantization Paths - INT8-Spiking for neuromorphic, FP8 for GPU
Optimized Training Pipeline - T2H (Transformer-to-Hybrid) conversion
Cross-Platform Compatibility - GPU + neuromorphic hardware support

Architecture

Dual-Space Sparse Attention (DSSA)

┌─────────────────────────────────────────────────────────────────┐
│                    DSSA Architecture                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│   Layer N      Layer N+1     Layer N+2      Layer N+3           │
│   ┌───┐        ┌───┐         ┌───┐          ┌───┐              │
│   │SSA│───────▶│SLA│────────▶│SSA│─────────▶│SLA│              │
│   └───┘        └───┘         └───┘          └───┘              │
│     │            │             │              │                  │
│     └────────────┴─────────────┴──────────────┘                  │
│                  Inter-Layer Hybrid                              │
│                                                                   │
│  SSA = Sparse Softmax Attention (MoBA)                          │
│  SLA = Sparse Linear Attention (SSE)                            │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Components

1. Sparse Softmax Attention (SSA)

Based on MoBA (Mixture of Block Attention)
Maintains softmax attention benefits
Sparse block-wise computation

2. Sparse Linear Attention (SLA)

Based on SSE (State Space Expansion)
Linear complexity O(n)
Efficient for very long sequences

Dual Quantization Strategy

┌──────────────────────────────────────────────────────────────┐
│               Dual Quantization Paths                         │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────────┐        ┌─────────────────┐              │
│  │   INT8-Spiking  │        │      FP8        │              │
│  │   Coding        │        │    Coding       │              │
│  ├─────────────────┤        ├─────────────────┤              │
│  │ • Event-driven  │        │ • GPU optimized │              │
│  │ • 64.31% sparse │        │ • 2.52x speedup │              │
│  │ • Neuromorphic  │        │ • A100/H100     │              │
│  │ • 70.6% area ↓  │        │ • Tensor cores  │              │
│  │ • 46.5% power ↓ │        │ • bf16/fp16     │              │
│  └────────┬────────┘        └────────┬────────┘              │
│           │                          │                        │
│           ▼                          ▼                        │
│  ┌─────────────────┐        ┌─────────────────┐              │
│  │  Neuromorphic   │        │    GPU          │              │
│  │  Hardware       │        │  (vLLM)         │              │
│  └─────────────────┘        └─────────────────┘              │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Training Pipeline

Transformer-to-Hybrid (T2H) Conversion

┌──────────────────────────────────────────────────────────────┐
│                  T2H Pipeline                                 │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│  Step 1: Base Model Selection                                 │
│  └── Qwen3-4B or other Transformer base                      │
│                                                               │
│  Step 2: Architecture Conversion                              │
│  ├── Replace full attention with DSSA blocks                 │
│  ├── Initialize SSA/SLA parameters                           │
│  └── Configure quantization paths                            │
│                                                               │
│  Step 3: Continue Pre-training                                │
│  ├── Curated open-source data                                │
│  ├── < 7k A100 GPU hours                                     │
│  └── Dual quantization training                               │
│                                                               │
│  Step 4: Fine-tuning                                          │
│  ├── LLM capabilities                                        │
│  └── VLM capabilities (SpB2.0-VL)                            │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Training Efficiency

Metric	Value
Training GPU Hours	< 7k A100
Base Model Recovery	Most capabilities
Long-Context Support	> 10M tokens
TTFT Speedup (4M ctx)	10.13x

Performance Results

Speedup Comparisons

Context Length    | 250k   | 1M     | 4M     | >10M
──────────────────┼────────┼────────┼────────┼──────
Full Attention    | 1.0x   | OOM    | OOM    | OOM
SpB2.0 (FP8 GPU)  | 2.52x  | 5.8x   | 10.13x | Supported
SpB2.0 (INT8 Spk) | Event-driven, 64.31% sparsity

Neuromorphic Deployment

Metric	Value
Sparsity	64.31%
Area Reduction	70.6%
Power Reduction	46.5%
Frequency	500MHz

Memory Efficiency

8x A100 (80GB) Configuration:
├── Full Attention: < 1M tokens (OOM beyond)
├── SpB2.0: > 10M tokens supported
└── vLLM integration: Efficient KV cache

Implementation Guide

Model Variants

# SpB2.0-5B (Base LLM)
model = SpikingBrain2_0.from_pretrained(
    "spikingbrain2.0-5b-base",
    quantization="fp8"  # or "int8-spiking"
)

# SpB2.0-VL-5B (Vision-Language)
model = SpikingBrain2_0_VL.from_pretrained(
    "spikingbrain2.0-vl-5b",
    quantization="fp8"
)

Inference Configuration

# Long-context inference
config = SpB2_0Config(
    max_position_embeddings=10_000_000,
    attention_type="dssa",  # Dual-Space Sparse Attention
    use_sliding_window=True,
    quantization_mode="fp8"  # or "int8-spiking"
)

# vLLM integration
from vllm import LLM
llm = LLM(
    model="spikingbrain2.0-5b",
    quantization="fp8",
    max_model_len=10_000_000
)

Neuromorphic Execution

# Neuromorphic hardware deployment
from spikingbrain import NeuromorphicRuntime

runtime = NeuromorphicRuntime(
    model_path="spikingbrain2.0-5b-int8",
    hardware="loihi2"  # or other neuromorphic chips
)

# Event-driven inference
output = runtime.generate(
    input_tokens,
    event_driven=True,
    sparsity_threshold=0.6431
)

Applications

1. Long-Context Document Processing

Legal document analysis
Scientific literature review
Code repository understanding

2. Multimodal Applications

Video understanding
Multi-image reasoning
Visual question answering

3. Edge Deployment

Low-power devices
Real-time applications
Robotics

4. Research

Brain-inspired AI
Efficient transformer architectures
Cross-platform ML

Comparison with Alternatives

Feature	Full Attention	Ring Attention	SpB2.0
Context	Limited	Very long	Very long
Memory	O(n²)	O(n)	O(n)
Training Cost	High	High	Low (<7k A100)
GPU Support	Yes	Yes	Yes + FP8
Neuromorphic	No	No	Yes

Limitations

5B Parameter Scale - Smaller than frontier models
Base Model Dependency - Requires existing Transformer
Sparse Attention Trade-offs - Minor accuracy loss possible
Hardware Requirements - Neuromorphic deployment needs specialized chips

Future Directions

Scale Up - 10B+ parameter versions
Multi-Modal Expansion - Audio, video integration
Training From Scratch - Remove base model dependency
Broader Hardware Support - More neuromorphic platforms

References

Original paper: arXiv:2604.22575v1
MoBA: Mixture of Block Attention
SSE: State Space Expansion
Qwen3: Base model architecture
vLLM: Inference engine

Related Skills

adaptive-spiking-neurons-asn - Adaptive Spiking Neurons
spikingbrain2.0-foundation-models - Previous version
stdp-spiking-transformer-attention - Spiking Transformers
cognisnn-brain-inspired-snn - CogniSNN architecture

Last updated: 2026-04-28