name: mast-aigv-detection-snn description: "MAST (Multi-channel pseudo-event SNN with Adaptive Spiking Temporal integrators) — first SNN-based detector for AI-generated videos. Converts inter-frame residuals into pseudo-events processed by spike-driven temporal branch with learnable per-channel time constants, fused with frozen X-CLIP semantic trajectory encoder. Achieves 93.14% cross-generator accuracy on GenVideo. Activation: AI-generated video detection, SNN video detection, temporal artifact detection, pseudo-event conversion, cross-generator generalization, MAST, spike-driven temporal integration, AIGV detection, boundary-localized firing, SDT-V3."
MAST: Multi-channel pseudo-event SNN with Adaptive Spiking Temporal integrators for AIGV Detection
First SNN-based detector for AI-generated video detection, leveraging the natural alignment between SNN event-driven sparse dynamics and the sparse, boundary-localized temporal artifacts in AI-generated videos.
Metadata
- Source: arXiv:2605.05895
- Authors: Minsuk Jang, Yujin Yang, Heeseon Kim, Minseok Son, Younghun Kim, Changick Kim
- Published: 2026-05-07
- Category: cs.CV, cs.AI
- Institution: KAIST
Core Problem
Modern AI-generated videos are photorealistic at the single-frame level. Detection must rely on temporal structure — whether motion, change, and semantic evolution remain natural over time. Prior detectors degrade sharply under cross-generator evaluation, where artifact type and timescale vary across generators.
Key Insight
AI-generated videos exhibit two complementary temporal signatures:
Pixel level: Smoother frame-to-frame temporal residuals with late-accumulating residual dynamics
- Spectral centroid f_c is 3× to 13× lower for generators than real videos
- Gap widens in later frames due to error accumulation under conditional generation
Semantic level: More compact trajectories in feature space
- X-CLIP trajectory convex-hull volume is 2× to 6× smaller for generators
- Angular curvature θ significantly reduced
Boundary-localized SNN firing: When raw video is fed to SNNs, fake clips elicit firing predominantly at object and motion boundaries (unlike real clips), suggesting SNN responds to temporal artifacts localized at edges
- Boundary Fire (BF) rate is 1.3×+ higher for fakes across all generators
- Interior Fire (IF) stays close to real — additional firing concentrates on boundary ring
This makes SNNs a natural choice for AIGV detection: their event-driven, sparsely-activated dynamics align with the structure of the residual signal in a way that dense ANN backbones do not.
MAST Architecture
Overview
MAST combines two complementary pathways:
- Spike-Driven Temporal Branch (SDTB) — processes multi-channel temporal residuals as pseudo-events with learnable per-channel time constants
- Frozen Semantic Encoder — X-CLIP trajectory encoder for semantic-level temporal coherence
1. Pseudo-Event Front-End
Converts inter-frame residuals into spike-like events:
ΔHF_t = |Lap(Y_t) - Lap(Y_{t-1})| # High-frequency Laplacian residual
ΔSobel_t = |Sobel(Y_t) - Sobel(Y_{t-1})| # Sobel edge residual
ΔAbsDiff_t = |Y_t - Y_{t-1}| # Absolute difference
ΔDiff2_t = |Y_t - 2Y_{t-1} + Y_{t-2}| # Second-order difference
Each channel is converted to pseudo-events via soft thresholding:
E_{t,c} = σ((|ΔF_{t,c}| - c_th) / β)
where c_th = 0.10 (contrast threshold) and β = 0.025 (temperature), matching physical event camera semantics.
Note: Chroma channel was tested but excluded — provided no additional signal for detection.
2. Spike-Driven Temporal Branch (SDTB)
Architecture:
- PerChannelLIF — per-channel LIF with learnable τ_c and V_{th,c}
- SDT-V3 Gate — Spiking Transformer with spike separable convolutions
- MultiSpike (L=4) — output values in {0, 1, 2, 3, 4} for richer spike communication
- Spike Anomaly Gate — adaptive spike integration with learnable timescales
- Linear attention (O(N D²)) instead of full attention (O(N² D)) for efficiency
PerChannelLIF dynamics:
v_t^{(c)} = (1 - 1/τ_c) * v_{t-1}^{(c)} + x_t^{(c)}
s_t^{(c)} = Spike(v_t^{(c)}, V_{th,c})
v_t^{(c)} ← v_t^{(c)} - s_t^{(c)} * V_{th,c} # soft reset
Parameters:
- τ_c ∈ [0.5, 20.0] — learnable time constant per channel
- V_{th,c} ∈ [0.05, 10.0] — learnable firing threshold per channel
- Both stored in log space and recovered via exp() to keep strictly positive
- Initialized at base values (τ_0 = 2.0, V_{th,0} = 1.0)
3. Semantic Trajectory Encoder
- X-CLIP-B/16 (frozen) — text-aligned video encoder
- Extracts per-frame embeddings from video encoder
- Computes trajectory curvature as auxiliary cue
- Cross-frame attention provides temporal coherence at semantic level
Why X-CLIP? Its cross-frame attention mechanism computes each embedding in temporal context of the entire clip, encoding short-range temporal coherence at semantic level — complementary to pixel-level residual dynamics of SDTB.
4. Fusion & Classification
final_logit = σ(W_t * z_t + W_s * z_s + b)
where z_t is SDTB output and z_s is semantic trajectory output.
Training Details
| Parameter | Value |
|---|---|
| Optimizer | AdamW (weight decay 0.01) |
| Epochs | 10 |
| Batch size | 16 per GPU × 4 GPUs = 64 effective |
| X-CLIP LR | 1×10⁻⁵ (frozen) |
| SDTB LR | 3×10⁻⁴ |
| Label smoothing | 0.1 |
| Gradient clipping | L2 norm 1.0 |
| Surrogate | ATan (α = 2.0) |
Auxiliary Objectives
| Loss | Weight | Purpose |
|---|---|---|
| SupCon | λ = 0.3 (τ_c = 0.07) | Contrastive learning for feature separation |
| SNN-only BCE | λ = 0.2 | Keeps SDTB gradient alive |
| Anomaly BCE | λ = 0.2 (margin 0.5) | Anomaly score supervision |
| Spike-rate penalty | λ = 0.01 (r* = 0.15) | Prevents SDTB silence/deadlock |
Critical Training Pitfall
Under "Main BCE only" (without auxiliary losses), the SDTB goes silent in the first epoch:
- Gate bias keeps sigmoid output below 0.15
- L=4 Multispike requires v/V_th > 0.5 to fire
- Xavier-initialized convolutions don't produce this
- Leaky integrator accumulates membrane potentials → NaN logits → training collapse
- Under DDP with find_unused_parameters=true → extra ALLREDUCE on NaN tensor → NCCL watchdog deadlock
Solution: The spike-rate regularizer (L_rate) forces non-zero firing rate from the first epoch, keeping SDTB gradient alive.
Results
GenVideo Cross-Generator (Pika-trained)
| Metric | Value |
|---|---|
| mACC | 93.14% (10 unseen generators) |
| mAUC | 94.95% |
This matches or surpasses the strongest ANN-based detectors.
Energy Efficiency
| Component | Ops | Energy |
|---|---|---|
| SDT-V3 (SNN gate) | 1.38 SOPs | 1.24 mJ |
| CNN-Transformer (ANN gate) | 18.61 MACs | 85.61 mJ |
| X-CLIP backbone (frozen) | 281.2 FLOPs | 1293.61 mJ |
SNN gate adds only +0.10% on top of backbone; matched ANN gate adds +6.62%. 69× energy savings for the gate at parameter parity.
GenVidBench Main Task
| Generator | Accuracy |
|---|---|
| CogVideo | 97.80% |
| Mora | 91.18% |
| HD-VG | 97.53% |
| MuseV | 55.55% |
| SVD | 41.99% |
Underperforms on flow-based generators (MuseV, SVD) — pseudo-event residuals align better with diffusion-style temporal artifacts.
SEINE Configuration
MAST achieves 77.41% mACC under SEINE training — significantly below Pika-trained (93.14%) due to:
- SEINE is frame-interpolation generator → spike gate adapts to low spike density
- Training/test clips sit at opposite ends of temporal-smoothness axis
- Temporal subsampling shortcut doesn't transfer to 24 fps test generators
Implementation Guide
Prerequisites
- PyTorch
- Pre-trained X-CLIP-B/16 (frozen)
- GenVidBench or GenVideo dataset
- 4× NVIDIA RTX 3090 GPUs (recommended)
Pseudo-Event Conversion
def compute_pseudo_events(frames):
"""Convert video frames to pseudo-event tensor.
frames: [B, T, C, H, W] normalized to [0, 1]
returns: [B, T, C_events, H', W'] pseudo-event tensor
"""
T = frames.shape[1]
events = []
for t in range(1, T):
# High-frequency Laplacian residual
lap_t = laplacian(frames[:, t])
lap_tm1 = laplacian(frames[:, t-1])
hf = torch.abs(lap_t - lap_tm1)
# Sobel edge residual
sobel_t = sobel(frames[:, t])
sobel_tm1 = sobel(frames[:, t-1])
sobel_res = torch.abs(sobel_t - sobel_tm1)
# Absolute difference
abs_diff = torch.abs(frames[:, t] - frames[:, t-1])
# Second-order difference
if t >= 2:
diff2 = torch.abs(frames[:, t] - 2*frames[:, t-1] + frames[:, t-2])
else:
diff2 = abs_diff # fallback
# Stack channels
event_t = torch.stack([hf, sobel_res, abs_diff, diff2], dim=1)
# Soft threshold to pseudo-events
c_th = 0.10
beta = 0.025
event_t = torch.sigmoid((event_t - c_th) / beta)
events.append(event_t)
return torch.stack(events, dim=1)
PerChannelLIF
class PerChannelLIF(nn.Module):
def __init__(self, n_channels, tau_base=2.0, vth_base=1.0):
super().__init__()
# Stored in log space for positivity
self.log_tau = nn.Parameter(torch.zeros(n_channels))
self.log_vth = nn.Parameter(torch.zeros(n_channels))
# Initialize at base values
nn.init.constant_(self.log_tau, math.log(tau_base))
nn.init.constant_(self.log_vth, math.log(vth_base))
def forward(self, x):
"""x: [B, C, T] input tensor"""
tau = torch.exp(self.log_tau).clamp(0.5, 20.0)
vth = torch.exp(self.log_vth).clamp(0.05, 10.0)
B, C, T = x.shape
v = torch.zeros(B, C, device=x.device)
spikes = []
for t in range(T):
# LIF dynamics per channel
v = (1 - 1/tau.unsqueeze(0)) * v + x[:, :, t]
# MultiSpike (L=4)
s = torch.clamp(v / vth.unsqueeze(0), 0, 4.5).round()
spikes.append(s)
v = v - s * vth.unsqueeze(0) # soft reset
return torch.stack(spikes, dim=2)
ATan Surrogate Gradient
class MultiSpikeATan(torch.autograd.Function):
@staticmethod
def forward(ctx, v, vth, L=4, alpha=2.0):
ctx.save_for_backward(v, torch.tensor(vth), torch.tensor(L), torch.tensor(alpha))
return torch.clamp(v / vth, 0, L + 0.5).round()
@staticmethod
def backward(ctx, grad_output):
v, vth, L, alpha = ctx.saved_tensors
L = int(L)
# Sum ATan kernels at each integer threshold
grad = torch.zeros_like(grad_output)
for k in range(L):
threshold = k + 0.5
grad += alpha / (2 * (1 + ((v - threshold) * alpha / 2) ** 2))
# Clip: zero gradient outside saturating range
mask = (v > 0) & (v < L)
grad = grad_output * grad * mask.float()
return grad, None, None, None
Applications
- AI-generated video detection: Cross-generator generalization across 10+ unseen generators
- Deepfake detection: Temporal artifact detection in synthetic media
- Content authentication: Verifying video provenance and authenticity
- Neuromorphic deployment: Energy-efficient detection on edge devices
- Temporal anomaly detection: Generalizable to any domain with temporal artifacts
Advantages Over ANN-Based Detectors
| Aspect | ANN Detectors | MAST (SNN) |
|---|---|---|
| Cross-generator mACC | ~85-92% | 93.14% |
| Gate energy | 85.61 mJ/clip | 1.24 mJ/clip (69× less) |
| Pipeline overhead | +6.62% | +0.10% |
| Temporal modeling | Dense backbone | Event-driven, sparse |
| Parameter efficiency | Requires large backbones | Lightweight SDTB (9.3M params) |
Pitfalls
- Training silence/deadlock: Without spike-rate regularization, SDTB goes silent in epoch 1, leading to NaN and NCCL deadlock. Always include L_rate penalty.
- SEINE training configuration: Underperforms (77.41% vs 93.14%) due to temporal-smoothness distribution mismatch. Train on generators with diverse temporal artifacts.
- Flow-based generators: Underperforms on MuseV/SVD where ReStraV is stronger. Pseudo-event residuals align better with diffusion-style artifacts.
- Chroma channel: Tested but excluded — provides no additional detection signal. Don't waste compute on color residuals.
- X-CLIP choice: Other text-aligned encoders (ViCLIP, InternVideo2) and generic video backbones (VideoMamba) underperform. X-CLIP's cross-frame attention is key.
- Firing rate target: r* = 0.15 is critical — too high wastes energy, too low causes silence.
- Soft vs hard reset: Soft reset is default; hard reset available as option but not used in experiments.
Related Skills
- spiking-neural-network-analysis
- snn-learning-survey
- spiking-oscillation-mapping
- edgespike-edge-iot-snn
- snn-performance-analysis
- spike-sparsity-deployment-cost
- quantization-spiking-neural-networks-beyond-accuracy
- sd-tv3-spike-driven-transformer