vs-wno-variable-spiking-wavelet - SKILL.md Agent Skill

name: vs-wno-variable-spiking-wavelet description: > Variable Spiking Wavelet Neural Operator (VS-WNO) — a systematic study of spiking sparsity versus real-world deployment cost on edge GPUs. Covers wavelet neural operators augmented with spiking mechanisms, variable sparsity control, hardware-aware model design, and deployment profiling on NVIDIA Jetson Orin Nano 8GB. triggers: - spiking neural network - wavelet neural operator - neural operator - deployment cost - Jetson - Jetson Orin Nano - edge computing - edge GPU - sparsity - neuromorphic - spiking sparsity - hardware-aware - wavelet transform - WNO - VS-WNO - model compression - energy efficiency - latency profiling paper: arxiv 2604.17040 categories: - cs.LG - cs.AR - cs.NE

Variable Spiking Wavelet Neural Operator (VS-WNO)

1. Overview

The Sparsity–Deployment Gap

Spiking neural networks (SNNs) are often touted for their theoretical energy-efficiency advantages: event-driven, sparse activations that should translate directly into lower power consumption and faster inference. In practice, theoretical spiking sparsity rarely maps linearly to real-world deployment savings on commodity hardware such as GPUs. The VS-WNO paper (arxiv 2604.17040) provides a systematic, empirical study of this gap by:

Introducing a Variable Spiking Wavelet Neural Operator that combines wavelet transforms (for efficient multi-scale feature extraction) with spiking mechanisms (for controllable activation sparsity).
Varying spiking sparsity across a wide range and measuring the resulting impact on accuracy, latency, throughput, memory, and power on an NVIDIA Jetson Orin Nano 8 GB.
Quantifying the mismatch between theoretical FLOP-reduction from sparsity and the actual wall-clock speedup / energy savings observed on real hardware.

Key finding: spiking sparsity alone is insufficient to guarantee deployment efficiency; hardware-aware design and operator-level optimisation are essential to close the gap.

2. Core Methodology

2.1 Wavelet Neural Operator (WNO) Backbone

The WNO replaces standard Fourier kernels with wavelet bases, providing:

Multi-resolution analysis: capture both fine-grained and coarse features simultaneously through wavelet decomposition levels.
Compact spectral representation: wavelet coefficients are naturally sparse, reducing the parameter count of integral kernels.
Boundary handling: wavelets avoid periodicity artifacts common in Fourier-based neural operators (e.g., FNO).

Typical architecture:

Input → Lift (linear projection) → Wavelet Layers × N → Project (linear) → Output

Each wavelet layer applies:

Forward wavelet transform (DWT) to the feature map.
Learned spectral convolution in wavelet domain.
Inverse wavelet transform (IDWT) to return to spatial domain.
Skip connection with a local linear or convolutional path.

2.2 Spiking Mechanism Integration

Binary or analogue spiking neurons replace standard activation functions (e.g., ReLU/GELU) after each wavelet layer:

Leaky Integrate-and-Fire (LIF) neurons with configurable membrane threshold θ and decay factor τ.
Surrogate gradient training to maintain differentiability during backpropagation through the spike generation step.
Variable threshold control: adjusting θ at each layer to dial spiking sparsity up or down, creating a controllable design knob.

Spiking sparsity is defined as:

sparsity = 1 - (number_of_spikes / total_neurons)   ∈ [0, 1]

Higher sparsity ⇒ fewer spikes ⇒ theoretically fewer memory reads/writes.

2.3 Variable Sparsity Control

The "Variable" in VS-WNO refers to the ability to tune spiking sparsity per-layer or globally without retraining from scratch. Strategies include:

Strategy	Mechanism	Flexibility
Threshold scaling	Multiply all LIF thresholds by α	Global
Per-layer threshold	Learnable θ_i per layer	Fine-grained
Time-step reduction	Use fewer SNN simulation steps	Moderate
Spike rate regularisation	Add λ·‖spike_rate − target‖² to loss	Training-time

The paper sweeps α across a range (e.g., 0.5× to 4× baseline θ) and measures how accuracy degrades versus deployment cost improves.

2.4 Hardware-Aware Design Principles

To actually benefit from sparsity on GPU hardware:

Structured spiking sparsity (entire channels or spatial tiles spike together) is preferred over random per-element sparsity — GPU warps cannot efficiently skip individual elements.
Block-sparse matrix formats (BSR) can exploit structured sparsity; random CSR on GPU is often slower than dense GEMM for moderate sizes.
Operator fusion: fuse spike generation + wavelet convolution into a single CUDA kernel to avoid round-trips to global memory.
Mixed-precision (FP16/INT8): combine quantisation with spiking for compounded savings; spiking alone saturates memory bandwidth benefits.

3. Implementation Guide: Deploying Spiking Models on Edge GPUs

3.1 Framework Choices

Framework	Spiking Support	Edge Deployment	Notes
Norse + PyTorch	Native LIF, SurrogateGrad	TorchScript → ONNX → TensorRT	Recommended for research
snnTorch	LIF, Synaptic models	TorchScript → TensorRT	Good community
Lava (Intel)	Loihi-native + PyTorch bridge	Loihi hardware only	Not GPU-targeted
Custom CUDA	Full control	Direct on Jetson	Maximum performance

3.2 Step-by-Step Deployment Pipeline

# 1. Train VS-WNO with Norse/snnTorch in PyTorch
python train_vswno.py --config configs/jetson_baseline.yaml

# 2. Export to TorchScript
python export.py --checkpoint best.ckpt --output model.pt

# 3. Convert to ONNX
python -c "
  import torch
  m = torch.jit.load('model.pt').eval().cuda()
  dummy = torch.randn(1, 1, 128, 128).cuda()
  torch.onnx.export(m, (dummy,), 'vswno.onnx', opset_version=17,
                     input_names=['input'], output_names=['output'],
                     dynamic_axes={'input':{0:'batch'}, 'output':{0:'batch'}})
"

# 4. Build TensorRT engine on Jetson
trtexec --onnx=vswno.onnx \
        --saveEngine=vswno.engine \
        --fp16 \
        --minShapes=input:1x1x128x128 \
        --optShapes=input:4x1x128x128 \
        --maxShapes=input:8x1x128x128

# 5. Benchmark on Jetson
trtexec --loadEngine=vswno.engine --iterations=200 --warmUp=50

3.3 Handling Spiking Time-Steps in Deployment

SNNs require multiple forward passes (time-steps) to accumulate membrane potential. Two deployment strategies:

Unroll in batch dimension: treat each time-step as a separate batch element → single TensorRT call → simpler but uses more memory.
Custom loop in CUDA/Triton: loop over time-steps inside the inference kernel → lower memory overhead but requires custom engine.

For the Jetson Orin Nano 8 GB, unrolling is feasible for ≤ 4–6 time-steps with batch size 1 at 128 × 128 resolution.

3.4 Power & Thermal Management on Jetson

# Set performance mode (max clocks)
sudo nvpmodel -m 0          # MAXN 15W mode

# Enable fan at max speed
sudo sh -c 'echo 255 > /sys/devices/pwm-fan/target_pwm'

# Monitor during inference
sudo tegrastats --interval 500

4. Hardware Profiling on Jetson Orin Nano 8 GB

4.1 Device Specifications

Spec	Value
SoC	NVIDIA Orin (T234)
GPU	1024-core Ampere (Ampere SM)
CPU	6-core Arm Cortex-A78AE
Memory	8 GB LPDDR5, 102 GB/s bandwidth
TDP	7 W / 15 W modes
Tensor Cores	32 (3rd gen)
CUDA Compute	8.7

4.2 Key Profiling Metrics

Metric	Tool	Command
Latency (ms)	`trtexec`	`trtexec --loadEngine=... --iterations=200`
Throughput (FPS)	`trtexec`	Derived from latency × batch
GPU Power (W)	`tegrastats`	`sudo tegrastats`
GPU Utilisation (%)	`tegrastats`	Inline in output
Memory usage (MB)	`tegrastats`	RAM + GPU shared memory
Energy per inference (mJ)	Computed	`latency × avg_power`
Temperature (°C)	`tegrastats`	Thermal throttling awareness

4.3 Profiling Workflow

# Terminal 1: Run inference loop
trtexec --loadEngine=vswno.engine --iterations=500 --warmUp=100 --duration=30

# Terminal 2: Capture power & thermal
sudo tegrastats --interval 100 --logfile tegra_log.txt

# Post-process: extract avg power, peak temp, avg GPU%
python parse_tegrastats.py tegra_log.txt --output profile_summary.json

4.4 Expected Performance Ranges (Indicative)

Config	Sparsity	Latency (ms)	Power (W)	Energy (mJ)	Notes
Dense WNO	0 %	~12–18	~8–10	~100–180	Baseline
VS-WNO (low sparsity)	~30 %	~11–16	~7.5–9.5	~85–150	Marginal gain
VS-WNO (med sparsity)	~60 %	~10–14	~7–9	~70–125	Best trade-off zone
VS-WNO (high sparsity)	~85 %	~10–15	~7–9	~70–135	Accuracy drops; diminishing returns
VS-WNO (extreme)	> 95 %	~10–18	~7–10	~70–180	Accuracy collapse; no speed gain

Key observation: latency and power plateau despite increasing sparsity because GPU memory access patterns and kernel launch overheads dominate; the zero-operations skipped by sparse spikes are "free" in FLOP count but not in wall-clock time on standard GPU architectures.

5. Sparsity vs Deployment Cost Analysis

5.1 The Fundamental Gap

            Theoretical        Observed on GPU
Sparsity    FLOP Reduction     Latency Reduction    Gap
─────────────────────────────────────────────────────────
  30 %        ~30 %              ~5–10 %           ~20 %
  60 %        ~60 %              ~10–25 %          ~35 %
  85 %        ~85 %              ~10–20 %          ~65 %
  95 %        ~95 %              ~5–15 %           ~80 %

The gap widens at higher sparsity because:

Memory-bound kernels: Wavelet transforms and spectral convolutions are often memory-bandwidth-limited, not compute-limited. Skipping FLOPs does not reduce memory traffic proportionally.
Irregular access patterns: Sparse spike activation leads to scattered memory reads that defeat GPU coalescing.
Kernel launch overhead: Small, sparse operations are dominated by kernel launch latency (~5–20 µs per kernel on Jetson).

5.2 Closing the Gap — Practical Strategies

Strategy	Mechanism	Potential Improvement
Structured sparsity (channel/tile)	Entire regions spike together	2–3× better latency scaling
Operator fusion (custom CUDA/Triton)	Single kernel for DWT+spike+IDWT	1.5–2× throughput
Quantisation (INT8/FP16)	Tensor Core utilisation	2–4× throughput, independent of sparsity
Sparse tensor formats (BSR)	Block-sparse matmul	Effective only above ~80% sparsity
Dedicated neuromorphic hardware (Loihi, Akida)	Native event-driven compute	Best match for spiking; not GPU

5.3 Sparsity–Accuracy Pareto

When plotting spiking sparsity (x-axis) vs task accuracy (y-axis), the VS-WNO typically exhibits:

Flat region (0–40% sparsity): accuracy barely drops.
Graceful degradation (40–70%): small accuracy loss for meaningful deployment gains.
Cliff region (> 80%): accuracy collapses rapidly.

The optimal operating point is typically in the 50–70% sparsity range, where deployment cost savings (if structured + fused) are non-trivial and accuracy remains acceptable.

6. Pitfalls

6.1 Spiking Sparsity ≠ Speed

The #1 misconception: "If my spiking network has 80% sparsity, it will be 5× faster."

On GPUs, this is almost never true. Dense matrix multiplication (GEMM) Tensor Core operations are so highly optimised that skipping 80% of operations via sparsity often results in slower inference because:

Sparse kernels have lower arithmetic intensity.
GPU warp schedulers cannot efficiently skip individual elements.
The overhead of sparse data structures (indices, masks) adds memory traffic.

Mitigation: Use structured sparsity, custom fused kernels, or target dedicated neuromorphic hardware.

6.2 Memory Bottlenecks Dominate on Edge GPUs

The Jetson Orin Nano shares 8 GB between CPU and GPU with ~102 GB/s bandwidth. For neural operators processing 128 × 128 or larger feature maps:

Activation memory grows with number of time-steps and layers.
Wavelet transform buffers (approximation + detail coefficients at each level) multiply memory footprint by ~4× per DWT level.
Spiking sparsity reduces compute but does not proportionally reduce activation memory — the membrane state and spike mask must still be stored.

Mitigation: Gradient checkpointing (training), activation recomputation, reduce time-steps, use lower-resolution wavelet decomposition.

6.3 Batch Size Effects

On Jetson hardware, latency and throughput are highly sensitive to batch size:

Batch Size	Latency/Sample	Throughput	GPU Utilisation
1	Lowest	Low	~30–50%
4	Moderate	Peak	~80–95%
8	Higher	Similar to 4	~90–100%
16+	OOM risk	N/A	Memory limited

For spiking models with T time-steps unrolled as batch dimension, the effective batch is batch_size × T. A batch of 2 with 4 time-steps = effective batch 8, which can exceed memory on 8 GB devices.

6.4 Wavelet Library Compatibility

PyWavelets (pywt): Python-only, not TensorRT compatible. Use only during training.
Custom DWT CUDA kernels: Required for TensorRT/ONNX deployment.
Approximation: Replace DWT with strided convolutions + fixed wavelet filters (Haar, DB2) for easier export.

6.5 Thermal Throttling

The Jetson Orin Nano in the default dev kit can thermally throttle within 2–5 minutes of sustained inference at MAXN mode (15 W). This introduces non-deterministic latency spikes that distort profiling results.

Mitigation: Active cooling, 7 W power mode for sustained workloads, or report results with thermal steady-state measurements.

6.6 Spiking Time-Step Accumulation Overhead

Each additional time-step increases latency roughly linearly but provides diminishing accuracy returns after ~4–6 steps. The paper recommends:

2–4 time-steps for edge deployment.
6–10 time-steps for server-side inference where latency is less critical.

7. Quick-Start Checklist

Train VS-WNO with variable threshold scaling; sweep α ∈ {0.5, 1.0, 1.5, 2.0, 3.0, 4.0}
Evaluate accuracy vs sparsity curve on validation set
Export to ONNX with fixed wavelet filters (Haar/DB2)
Build TensorRT engine with FP16 on Jetson Orin Nano
Profile with trtexec + tegrastats; record latency, power, energy
Identify sweet spot (typically 50–70% sparsity, 2–4 time-steps)
If latency plateau: investigate custom fused kernels or structured sparsity
Validate thermal steady-state (run ≥ 5 min before recording metrics)

8. References

VS-WNO Paper: "Variable Spiking Wavelet Neural Operator: A Systematic Study of Spiking Sparsity vs Real-World Deployment Cost", arXiv 2604.17040,
1. Categories: cs.LG, cs.AR, cs.NE.
Wavelet Neural Operator: Takamoto, M., et al. "Wavelet Neural Operator: a neural operator for parametric PDEs." arXiv preprint, 2022.
Fourier Neural Operator (FNO): Li, Z., et al. "Fourier Neural Operator for Parametric Partial Differential Equations." ICLR, 2021.
Spiking Neural Networks on GPUs: Fang, W., et al. "Incorporating Learnable Membrane Time Constants to Enhance Learning of Spiking Neural Networks." ICCV, 2021. (snnTorch / surrogate gradients)
Structured Sparsity on GPUs: Nvidia Ampere sparse tensor core documentation; 2:4 structured sparsity pattern for 2× throughput.
Jetson Orin Nano: NVIDIA Jetson Orin Nano Developer Kit User Guide, NVIDIA Developer Documentation, 2024.
Norse (Spiking Library): Pehle, C., et al. "Norse — A library for deep learning with spiking neural networks." GitHub, 2021.
TensorRT: NVIDIA TensorRT Documentation, Developer Zone, 2024.