spikingbrain2-0-foundation-models

star 2

SpikingBrain2.0 (SpB2.0) - 5B parameter brain-inspired foundation model with Dual-Space Sparse Attention (DSSA) for efficient long-context and cross-platform inference. Features INT8-Spiking coding for neuromorphic execution and FP8 for GPU acceleration. Activation: SpikingBrain2.0, DSSA, sparse attention, neuromorphic foundation model.

hiyenwong By hiyenwong schedule Updated 6/3/2026

name: spikingbrain2.0-foundation-models category: ai_collection paper: arXiv:2604.22575v1 description: "SpikingBrain2.0 (SpB2.0) - 5B parameter brain-inspired foundation model with Dual-Space Sparse Attention (DSSA) for efficient long-context and cross-platform inference. Features INT8-Spiking coding for neuromorphic execution and FP8 for GPU acceleration. Activation: SpikingBrain2.0, DSSA, sparse attention, neuromorphic foundation model." date: 2026-04-28

SpikingBrain2.0: Brain-Inspired Foundation Models

Overview

Paper: SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
arXiv: 2604.22575v1
Authors: Yuqi Pan, Jinghao Zhuang, Yupeng Feng, Fangzhi Zhong
Published: 2026-04-24
Categories: cs.LG (Machine Learning)

Key Contributions

  1. Dual-Space Sparse Attention (DSSA) - Hybrid inter-layer sparse attention mechanism
  2. Dual Quantization Paths - INT8-Spiking for neuromorphic, FP8 for GPU
  3. Optimized Training Pipeline - T2H (Transformer-to-Hybrid) conversion
  4. Cross-Platform Compatibility - GPU + neuromorphic hardware support

Architecture

Dual-Space Sparse Attention (DSSA)

┌─────────────────────────────────────────────────────────────────┐
│                    DSSA Architecture                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│   Layer N      Layer N+1     Layer N+2      Layer N+3           │
│   ┌───┐        ┌───┐         ┌───┐          ┌───┐              │
│   │SSA│───────▶│SLA│────────▶│SSA│─────────▶│SLA│              │
│   └───┘        └───┘         └───┘          └───┘              │
│     │            │             │              │                  │
│     └────────────┴─────────────┴──────────────┘                  │
│                  Inter-Layer Hybrid                              │
│                                                                   │
│  SSA = Sparse Softmax Attention (MoBA)                          │
│  SLA = Sparse Linear Attention (SSE)                            │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Components

1. Sparse Softmax Attention (SSA)

  • Based on MoBA (Mixture of Block Attention)
  • Maintains softmax attention benefits
  • Sparse block-wise computation

2. Sparse Linear Attention (SLA)

  • Based on SSE (State Space Expansion)
  • Linear complexity O(n)
  • Efficient for very long sequences

Dual Quantization Strategy

┌──────────────────────────────────────────────────────────────┐
│               Dual Quantization Paths                         │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────────┐        ┌─────────────────┐              │
│  │   INT8-Spiking  │        │      FP8        │              │
│  │   Coding        │        │    Coding       │              │
│  ├─────────────────┤        ├─────────────────┤              │
│  │ • Event-driven  │        │ • GPU optimized │              │
│  │ • 64.31% sparse │        │ • 2.52x speedup │              │
│  │ • Neuromorphic  │        │ • A100/H100     │              │
│  │ • 70.6% area ↓  │        │ • Tensor cores  │              │
│  │ • 46.5% power ↓ │        │ • bf16/fp16     │              │
│  └────────┬────────┘        └────────┬────────┘              │
│           │                          │                        │
│           ▼                          ▼                        │
│  ┌─────────────────┐        ┌─────────────────┐              │
│  │  Neuromorphic   │        │    GPU          │              │
│  │  Hardware       │        │  (vLLM)         │              │
│  └─────────────────┘        └─────────────────┘              │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Training Pipeline

Transformer-to-Hybrid (T2H) Conversion

┌──────────────────────────────────────────────────────────────┐
│                  T2H Pipeline                                 │
├──────────────────────────────────────────────────────────────┤
│                                                               │
│  Step 1: Base Model Selection                                 │
│  └── Qwen3-4B or other Transformer base                      │
│                                                               │
│  Step 2: Architecture Conversion                              │
│  ├── Replace full attention with DSSA blocks                 │
│  ├── Initialize SSA/SLA parameters                           │
│  └── Configure quantization paths                            │
│                                                               │
│  Step 3: Continue Pre-training                                │
│  ├── Curated open-source data                                │
│  ├── < 7k A100 GPU hours                                     │
│  └── Dual quantization training                               │
│                                                               │
│  Step 4: Fine-tuning                                          │
│  ├── LLM capabilities                                        │
│  └── VLM capabilities (SpB2.0-VL)                            │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Training Efficiency

Metric Value
Training GPU Hours < 7k A100
Base Model Recovery Most capabilities
Long-Context Support > 10M tokens
TTFT Speedup (4M ctx) 10.13x

Performance Results

Speedup Comparisons

Context Length    | 250k   | 1M     | 4M     | >10M
──────────────────┼────────┼────────┼────────┼──────
Full Attention    | 1.0x   | OOM    | OOM    | OOM
SpB2.0 (FP8 GPU)  | 2.52x  | 5.8x   | 10.13x | Supported
SpB2.0 (INT8 Spk) | Event-driven, 64.31% sparsity

Neuromorphic Deployment

Metric Value
Sparsity 64.31%
Area Reduction 70.6%
Power Reduction 46.5%
Frequency 500MHz

Memory Efficiency

8x A100 (80GB) Configuration:
├── Full Attention: < 1M tokens (OOM beyond)
├── SpB2.0: > 10M tokens supported
└── vLLM integration: Efficient KV cache

Implementation Guide

Model Variants

# SpB2.0-5B (Base LLM)
model = SpikingBrain2_0.from_pretrained(
    "spikingbrain2.0-5b-base",
    quantization="fp8"  # or "int8-spiking"
)

# SpB2.0-VL-5B (Vision-Language)
model = SpikingBrain2_0_VL.from_pretrained(
    "spikingbrain2.0-vl-5b",
    quantization="fp8"
)

Inference Configuration

# Long-context inference
config = SpB2_0Config(
    max_position_embeddings=10_000_000,
    attention_type="dssa",  # Dual-Space Sparse Attention
    use_sliding_window=True,
    quantization_mode="fp8"  # or "int8-spiking"
)

# vLLM integration
from vllm import LLM
llm = LLM(
    model="spikingbrain2.0-5b",
    quantization="fp8",
    max_model_len=10_000_000
)

Neuromorphic Execution

# Neuromorphic hardware deployment
from spikingbrain import NeuromorphicRuntime

runtime = NeuromorphicRuntime(
    model_path="spikingbrain2.0-5b-int8",
    hardware="loihi2"  # or other neuromorphic chips
)

# Event-driven inference
output = runtime.generate(
    input_tokens,
    event_driven=True,
    sparsity_threshold=0.6431
)

Applications

1. Long-Context Document Processing

  • Legal document analysis
  • Scientific literature review
  • Code repository understanding

2. Multimodal Applications

  • Video understanding
  • Multi-image reasoning
  • Visual question answering

3. Edge Deployment

  • Low-power devices
  • Real-time applications
  • Robotics

4. Research

  • Brain-inspired AI
  • Efficient transformer architectures
  • Cross-platform ML

Comparison with Alternatives

Feature Full Attention Ring Attention SpB2.0
Context Limited Very long Very long
Memory O(n²) O(n) O(n)
Training Cost High High Low (<7k A100)
GPU Support Yes Yes Yes + FP8
Neuromorphic No No Yes

Limitations

  1. 5B Parameter Scale - Smaller than frontier models
  2. Base Model Dependency - Requires existing Transformer
  3. Sparse Attention Trade-offs - Minor accuracy loss possible
  4. Hardware Requirements - Neuromorphic deployment needs specialized chips

Future Directions

  1. Scale Up - 10B+ parameter versions
  2. Multi-Modal Expansion - Audio, video integration
  3. Training From Scratch - Remove base model dependency
  4. Broader Hardware Support - More neuromorphic platforms

References

  1. Original paper: arXiv:2604.22575v1
  2. MoBA: Mixture of Block Attention
  3. SSE: State Space Expansion
  4. Qwen3: Base model architecture
  5. vLLM: Inference engine

Related Skills

  • adaptive-spiking-neurons-asn - Adaptive Spiking Neurons
  • spikingbrain2.0-foundation-models - Previous version
  • stdp-spiking-transformer-attention - Spiking Transformers
  • cognisnn-brain-inspired-snn - CogniSNN architecture

Last updated: 2026-04-28

Install via CLI
npx skills add https://github.com/hiyenwong/ai_collection --skill spikingbrain2-0-foundation-models
Repository Details
star Stars 2
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator