name: vs-wno-variable-spiking-wavelet description: > Variable Spiking Wavelet Neural Operator (VS-WNO) — a systematic study of spiking sparsity versus real-world deployment cost on edge GPUs. Covers wavelet neural operators augmented with spiking mechanisms, variable sparsity control, hardware-aware model design, and deployment profiling on NVIDIA Jetson Orin Nano 8GB. triggers: - spiking neural network - wavelet neural operator - neural operator - deployment cost - Jetson - Jetson Orin Nano - edge computing - edge GPU - sparsity - neuromorphic - spiking sparsity - hardware-aware - wavelet transform - WNO - VS-WNO - model compression - energy efficiency - latency profiling paper: arxiv 2604.17040 categories: - cs.LG - cs.AR - cs.NE
Variable Spiking Wavelet Neural Operator (VS-WNO)
1. Overview
The Sparsity–Deployment Gap
Spiking neural networks (SNNs) are often touted for their theoretical energy-efficiency advantages: event-driven, sparse activations that should translate directly into lower power consumption and faster inference. In practice, theoretical spiking sparsity rarely maps linearly to real-world deployment savings on commodity hardware such as GPUs. The VS-WNO paper (arxiv 2604.17040) provides a systematic, empirical study of this gap by:
- Introducing a Variable Spiking Wavelet Neural Operator that combines wavelet transforms (for efficient multi-scale feature extraction) with spiking mechanisms (for controllable activation sparsity).
- Varying spiking sparsity across a wide range and measuring the resulting impact on accuracy, latency, throughput, memory, and power on an NVIDIA Jetson Orin Nano 8 GB.
- Quantifying the mismatch between theoretical FLOP-reduction from sparsity and the actual wall-clock speedup / energy savings observed on real hardware.
Key finding: spiking sparsity alone is insufficient to guarantee deployment efficiency; hardware-aware design and operator-level optimisation are essential to close the gap.
2. Core Methodology
2.1 Wavelet Neural Operator (WNO) Backbone
The WNO replaces standard Fourier kernels with wavelet bases, providing:
- Multi-resolution analysis: capture both fine-grained and coarse features simultaneously through wavelet decomposition levels.
- Compact spectral representation: wavelet coefficients are naturally sparse, reducing the parameter count of integral kernels.
- Boundary handling: wavelets avoid periodicity artifacts common in Fourier-based neural operators (e.g., FNO).
Typical architecture:
Input → Lift (linear projection) → Wavelet Layers × N → Project (linear) → Output
Each wavelet layer applies:
- Forward wavelet transform (DWT) to the feature map.
- Learned spectral convolution in wavelet domain.
- Inverse wavelet transform (IDWT) to return to spatial domain.
- Skip connection with a local linear or convolutional path.
2.2 Spiking Mechanism Integration
Binary or analogue spiking neurons replace standard activation functions (e.g., ReLU/GELU) after each wavelet layer:
- Leaky Integrate-and-Fire (LIF) neurons with configurable membrane threshold θ and decay factor τ.
- Surrogate gradient training to maintain differentiability during backpropagation through the spike generation step.
- Variable threshold control: adjusting θ at each layer to dial spiking sparsity up or down, creating a controllable design knob.
Spiking sparsity is defined as:
sparsity = 1 - (number_of_spikes / total_neurons) ∈ [0, 1]
Higher sparsity ⇒ fewer spikes ⇒ theoretically fewer memory reads/writes.
2.3 Variable Sparsity Control
The "Variable" in VS-WNO refers to the ability to tune spiking sparsity per-layer or globally without retraining from scratch. Strategies include:
| Strategy | Mechanism | Flexibility |
|---|---|---|
| Threshold scaling | Multiply all LIF thresholds by α | Global |
| Per-layer threshold | Learnable θ_i per layer | Fine-grained |
| Time-step reduction | Use fewer SNN simulation steps | Moderate |
| Spike rate regularisation | Add λ·‖spike_rate − target‖² to loss | Training-time |
The paper sweeps α across a range (e.g., 0.5× to 4× baseline θ) and measures how accuracy degrades versus deployment cost improves.
2.4 Hardware-Aware Design Principles
To actually benefit from sparsity on GPU hardware:
- Structured spiking sparsity (entire channels or spatial tiles spike together) is preferred over random per-element sparsity — GPU warps cannot efficiently skip individual elements.
- Block-sparse matrix formats (BSR) can exploit structured sparsity; random CSR on GPU is often slower than dense GEMM for moderate sizes.
- Operator fusion: fuse spike generation + wavelet convolution into a single CUDA kernel to avoid round-trips to global memory.
- Mixed-precision (FP16/INT8): combine quantisation with spiking for compounded savings; spiking alone saturates memory bandwidth benefits.
3. Implementation Guide: Deploying Spiking Models on Edge GPUs
3.1 Framework Choices
| Framework | Spiking Support | Edge Deployment | Notes |
|---|---|---|---|
| Norse + PyTorch | Native LIF, SurrogateGrad | TorchScript → ONNX → TensorRT | Recommended for research |
| snnTorch | LIF, Synaptic models | TorchScript → TensorRT | Good community |
| Lava (Intel) | Loihi-native + PyTorch bridge | Loihi hardware only | Not GPU-targeted |
| Custom CUDA | Full control | Direct on Jetson | Maximum performance |
3.2 Step-by-Step Deployment Pipeline
# 1. Train VS-WNO with Norse/snnTorch in PyTorch
python train_vswno.py --config configs/jetson_baseline.yaml
# 2. Export to TorchScript
python export.py --checkpoint best.ckpt --output model.pt
# 3. Convert to ONNX
python -c "
import torch
m = torch.jit.load('model.pt').eval().cuda()
dummy = torch.randn(1, 1, 128, 128).cuda()
torch.onnx.export(m, (dummy,), 'vswno.onnx', opset_version=17,
input_names=['input'], output_names=['output'],
dynamic_axes={'input':{0:'batch'}, 'output':{0:'batch'}})
"
# 4. Build TensorRT engine on Jetson
trtexec --onnx=vswno.onnx \
--saveEngine=vswno.engine \
--fp16 \
--minShapes=input:1x1x128x128 \
--optShapes=input:4x1x128x128 \
--maxShapes=input:8x1x128x128
# 5. Benchmark on Jetson
trtexec --loadEngine=vswno.engine --iterations=200 --warmUp=50
3.3 Handling Spiking Time-Steps in Deployment
SNNs require multiple forward passes (time-steps) to accumulate membrane potential. Two deployment strategies:
- Unroll in batch dimension: treat each time-step as a separate batch element → single TensorRT call → simpler but uses more memory.
- Custom loop in CUDA/Triton: loop over time-steps inside the inference kernel → lower memory overhead but requires custom engine.
For the Jetson Orin Nano 8 GB, unrolling is feasible for ≤ 4–6 time-steps with batch size 1 at 128 × 128 resolution.
3.4 Power & Thermal Management on Jetson
# Set performance mode (max clocks)
sudo nvpmodel -m 0 # MAXN 15W mode
# Enable fan at max speed
sudo sh -c 'echo 255 > /sys/devices/pwm-fan/target_pwm'
# Monitor during inference
sudo tegrastats --interval 500
4. Hardware Profiling on Jetson Orin Nano 8 GB
4.1 Device Specifications
| Spec | Value |
|---|---|
| SoC | NVIDIA Orin (T234) |
| GPU | 1024-core Ampere (Ampere SM) |
| CPU | 6-core Arm Cortex-A78AE |
| Memory | 8 GB LPDDR5, 102 GB/s bandwidth |
| TDP | 7 W / 15 W modes |
| Tensor Cores | 32 (3rd gen) |
| CUDA Compute | 8.7 |
4.2 Key Profiling Metrics
| Metric | Tool | Command |
|---|---|---|
| Latency (ms) | trtexec |
trtexec --loadEngine=... --iterations=200 |
| Throughput (FPS) | trtexec |
Derived from latency × batch |
| GPU Power (W) | tegrastats |
sudo tegrastats |
| GPU Utilisation (%) | tegrastats |
Inline in output |
| Memory usage (MB) | tegrastats |
RAM + GPU shared memory |
| Energy per inference (mJ) | Computed | latency × avg_power |
| Temperature (°C) | tegrastats |
Thermal throttling awareness |
4.3 Profiling Workflow
# Terminal 1: Run inference loop
trtexec --loadEngine=vswno.engine --iterations=500 --warmUp=100 --duration=30
# Terminal 2: Capture power & thermal
sudo tegrastats --interval 100 --logfile tegra_log.txt
# Post-process: extract avg power, peak temp, avg GPU%
python parse_tegrastats.py tegra_log.txt --output profile_summary.json
4.4 Expected Performance Ranges (Indicative)
| Config | Sparsity | Latency (ms) | Power (W) | Energy (mJ) | Notes |
|---|---|---|---|---|---|
| Dense WNO | 0 % | ~12–18 | ~8–10 | ~100–180 | Baseline |
| VS-WNO (low sparsity) | ~30 % | ~11–16 | ~7.5–9.5 | ~85–150 | Marginal gain |
| VS-WNO (med sparsity) | ~60 % | ~10–14 | ~7–9 | ~70–125 | Best trade-off zone |
| VS-WNO (high sparsity) | ~85 % | ~10–15 | ~7–9 | ~70–135 | Accuracy drops; diminishing returns |
| VS-WNO (extreme) | > 95 % | ~10–18 | ~7–10 | ~70–180 | Accuracy collapse; no speed gain |
Key observation: latency and power plateau despite increasing sparsity because GPU memory access patterns and kernel launch overheads dominate; the zero-operations skipped by sparse spikes are "free" in FLOP count but not in wall-clock time on standard GPU architectures.
5. Sparsity vs Deployment Cost Analysis
5.1 The Fundamental Gap
Theoretical Observed on GPU
Sparsity FLOP Reduction Latency Reduction Gap
─────────────────────────────────────────────────────────
30 % ~30 % ~5–10 % ~20 %
60 % ~60 % ~10–25 % ~35 %
85 % ~85 % ~10–20 % ~65 %
95 % ~95 % ~5–15 % ~80 %
The gap widens at higher sparsity because:
- Memory-bound kernels: Wavelet transforms and spectral convolutions are often memory-bandwidth-limited, not compute-limited. Skipping FLOPs does not reduce memory traffic proportionally.
- Irregular access patterns: Sparse spike activation leads to scattered memory reads that defeat GPU coalescing.
- Kernel launch overhead: Small, sparse operations are dominated by kernel launch latency (~5–20 µs per kernel on Jetson).
5.2 Closing the Gap — Practical Strategies
| Strategy | Mechanism | Potential Improvement |
|---|---|---|
| Structured sparsity (channel/tile) | Entire regions spike together | 2–3× better latency scaling |
| Operator fusion (custom CUDA/Triton) | Single kernel for DWT+spike+IDWT | 1.5–2× throughput |
| Quantisation (INT8/FP16) | Tensor Core utilisation | 2–4× throughput, independent of sparsity |
| Sparse tensor formats (BSR) | Block-sparse matmul | Effective only above ~80% sparsity |
| Dedicated neuromorphic hardware (Loihi, Akida) | Native event-driven compute | Best match for spiking; not GPU |
5.3 Sparsity–Accuracy Pareto
When plotting spiking sparsity (x-axis) vs task accuracy (y-axis), the VS-WNO typically exhibits:
- Flat region (0–40% sparsity): accuracy barely drops.
- Graceful degradation (40–70%): small accuracy loss for meaningful deployment gains.
- Cliff region (> 80%): accuracy collapses rapidly.
The optimal operating point is typically in the 50–70% sparsity range, where deployment cost savings (if structured + fused) are non-trivial and accuracy remains acceptable.
6. Pitfalls
6.1 Spiking Sparsity ≠ Speed
The #1 misconception: "If my spiking network has 80% sparsity, it will be 5× faster."
On GPUs, this is almost never true. Dense matrix multiplication (GEMM) Tensor Core operations are so highly optimised that skipping 80% of operations via sparsity often results in slower inference because:
- Sparse kernels have lower arithmetic intensity.
- GPU warp schedulers cannot efficiently skip individual elements.
- The overhead of sparse data structures (indices, masks) adds memory traffic.
Mitigation: Use structured sparsity, custom fused kernels, or target dedicated neuromorphic hardware.
6.2 Memory Bottlenecks Dominate on Edge GPUs
The Jetson Orin Nano shares 8 GB between CPU and GPU with ~102 GB/s bandwidth. For neural operators processing 128 × 128 or larger feature maps:
- Activation memory grows with number of time-steps and layers.
- Wavelet transform buffers (approximation + detail coefficients at each level) multiply memory footprint by ~4× per DWT level.
- Spiking sparsity reduces compute but does not proportionally reduce activation memory — the membrane state and spike mask must still be stored.
Mitigation: Gradient checkpointing (training), activation recomputation, reduce time-steps, use lower-resolution wavelet decomposition.
6.3 Batch Size Effects
On Jetson hardware, latency and throughput are highly sensitive to batch size:
| Batch Size | Latency/Sample | Throughput | GPU Utilisation |
|---|---|---|---|
| 1 | Lowest | Low | ~30–50% |
| 4 | Moderate | Peak | ~80–95% |
| 8 | Higher | Similar to 4 | ~90–100% |
| 16+ | OOM risk | N/A | Memory limited |
For spiking models with T time-steps unrolled as batch dimension, the
effective batch is batch_size × T. A batch of 2 with 4 time-steps
= effective batch 8, which can exceed memory on 8 GB devices.
6.4 Wavelet Library Compatibility
- PyWavelets (
pywt): Python-only, not TensorRT compatible. Use only during training. - Custom DWT CUDA kernels: Required for TensorRT/ONNX deployment.
- Approximation: Replace DWT with strided convolutions + fixed wavelet filters (Haar, DB2) for easier export.
6.5 Thermal Throttling
The Jetson Orin Nano in the default dev kit can thermally throttle within 2–5 minutes of sustained inference at MAXN mode (15 W). This introduces non-deterministic latency spikes that distort profiling results.
Mitigation: Active cooling, 7 W power mode for sustained workloads, or report results with thermal steady-state measurements.
6.6 Spiking Time-Step Accumulation Overhead
Each additional time-step increases latency roughly linearly but provides diminishing accuracy returns after ~4–6 steps. The paper recommends:
- 2–4 time-steps for edge deployment.
- 6–10 time-steps for server-side inference where latency is less critical.
7. Quick-Start Checklist
- Train VS-WNO with variable threshold scaling; sweep α ∈ {0.5, 1.0, 1.5, 2.0, 3.0, 4.0}
- Evaluate accuracy vs sparsity curve on validation set
- Export to ONNX with fixed wavelet filters (Haar/DB2)
- Build TensorRT engine with FP16 on Jetson Orin Nano
- Profile with
trtexec+tegrastats; record latency, power, energy - Identify sweet spot (typically 50–70% sparsity, 2–4 time-steps)
- If latency plateau: investigate custom fused kernels or structured sparsity
- Validate thermal steady-state (run ≥ 5 min before recording metrics)
8. References
VS-WNO Paper: "Variable Spiking Wavelet Neural Operator: A Systematic Study of Spiking Sparsity vs Real-World Deployment Cost", arXiv 2604.17040,
- Categories: cs.LG, cs.AR, cs.NE.
Wavelet Neural Operator: Takamoto, M., et al. "Wavelet Neural Operator: a neural operator for parametric PDEs." arXiv preprint, 2022.
Fourier Neural Operator (FNO): Li, Z., et al. "Fourier Neural Operator for Parametric Partial Differential Equations." ICLR, 2021.
Spiking Neural Networks on GPUs: Fang, W., et al. "Incorporating Learnable Membrane Time Constants to Enhance Learning of Spiking Neural Networks." ICCV, 2021. (snnTorch / surrogate gradients)
Structured Sparsity on GPUs: Nvidia Ampere sparse tensor core documentation; 2:4 structured sparsity pattern for 2× throughput.
Jetson Orin Nano: NVIDIA Jetson Orin Nano Developer Kit User Guide, NVIDIA Developer Documentation, 2024.
Norse (Spiking Library): Pehle, C., et al. "Norse — A library for deep learning with spiking neural networks." GitHub, 2021.
TensorRT: NVIDIA TensorRT Documentation, Developer Zone, 2024.