lvsa-tuning

star 17

Tune LVSA for quality vs speed. Use when adjusting sparsity_scale, choosing window_size and n_first_frames, deciding when --rotate-keyframes pays off, composing LVSA with RIFLEx, or hitting a quality regression and needing to back off sparsity.

JiusiServe By JiusiServe schedule Updated 6/1/2026

name: lvsa-tuning description: Tune LVSA for quality vs speed. Use when adjusting sparsity_scale, choosing window_size and n_first_frames, deciding when --rotate-keyframes pays off, composing LVSA with RIFLEx, or hitting a quality regression and needing to back off sparsity.

LVSA Tuning

The three primary knobs

Knob What it controls When to touch it
reference_latent_frames Per-query attention budget anchor Set once per model (Wan=21, HV=33, Cog=13). Don't change at runtime.
sparsity_scale Multiplier on the budget The runtime quality/speed dial. Default 1.0; lower = sparser.
window_size, n_first_frames Local-window geometry Usually leave at defaults (12 frames / 4 frames). Only touch if you want a tighter floor.

sparsity_scale — the headline dial

LVSA_SPARSITY_SCALE (env var) or --sparsity-scale (CLI). Scales the auto-keyframe scheduler's per-query budget.

scaled_ref     = max(n_first + 1, int(reference_frames × sparsity_scale))
target_attended = min(scaled_ref, T_lat)

Empirical results on HunyuanVideo at 129 frames (training reference, single-prompt "dog"):

sparsity_scale Per-step Speedup vs dense VQeval composite VQeval loop
(dense baseline) 44.0 s 57.6 32.6
0.5 (aggressive) 18.3 s 2.40× 65.2 (+7.6) 73.6 (+41.0)
1.0 (default) 22.5 s 1.96× 61.3 (+3.7) 63.0 (+30.4)

Rule of thumb

Goal sparsity_scale Why
Match dense quality at training reference, take implementation speedup 1.0 At T_lat ≤ ref this collapses to kfi=1 (fully dense). Speedup comes from bypassing native attention overhead.
Maximum speedup at training reference 0.5 Engages pattern-driven sparsity even at T_lat=ref. Big loop-quality gains; ~5pt drop on dynamic_quality.
Aggressive extrapolation, OOM-prevention at 3×+ horizon 0.5 Shrinks compact-K buffer, helps fit on 80 GB.
Conservative quality at extrapolation 0.75 Reduces sparsity gradient; less speedup but keeps motion intact.

Important nuances

  • At T_lat ≤ reference, any sparsity_scale ≥ 1.0 collapses to kfi=1 (fully dense). The visible speedup is implementation efficiency only.
  • sparsity_scale = 2.0 is equivalent to 1.0 at T_lat ≤ reference (both give kfi=1). The conservative knob is meaningful only at extrapolation lengths.
  • sparsity_scale = 0.5 activates real pattern sparsity even at training reference: HV's budget shrinks from 33 to 16 latents at 1×, giving ~52% coverage.
  • The large loop_quality gain at s=0.5 comes from --rotate-keyframes dithering the attention pattern each step. Disable rotation and the loop gain disappears.

window_size + n_first_frames

Defaults:

  • window_size = 12 video frames = 3 latent frames (W=3)
  • n_first_frames = 4 video frames = 1 latent frame (n_first=1)

Floor of attended frames per query: 2W+1 + n_first = 8 latent frames.

When to reduce W: never, unless your reference_latent_frames is below the floor. The defaults are tuned for current models.

When to increase W (e.g. W=4):

  • Motion-heavy prompts losing dynamic_quality at extension — bigger window = more long-range mixing inside each query's attended set.
  • Costs ~10% wall time per W += 1.

Should I use --rotate-keyframes?

At length Without rotation With rotation
T_lat ≤ reference No effect (kfi=1 means every frame is a global anyway) No effect
Slight extension (T ≈ 1.5×) Static keyframes can introduce period artifacts Smoother
Heavy extension (T ≥ 3×) Output starts to loop / freeze Strongly preferred — this is the mechanism that prevents the "frozen video" failure mode

Default --rotate-keyframes on whenever you're extending. Off at training horizon adds nothing.

Composing with RIFLEx

RIFLEx rescales the RoPE frequencies to extrapolate beyond the training horizon. It's orthogonal to LVSA (RoPE-only, no attention compute change) and stacks cleanly:

python examples/wan_generate.py \
    --model /path/to/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "..." \
    --num-frames 321 \
    --lvsa --flashinfer --rotate-keyframes --auto-keyframes \
    --riflex --riflex-s 4.0

At extension lengths RIFLEx + LVSA-FI is the recommended recipe. On the SotA grid (Wan 1.3B, 5 prompts):

Horizon LVSA-FI alone LVSA-FI + RIFLEx
1.43× faster than Dense ~same speed, slight quality bump
2.41× faster than Dense ~same speed, +1 VQeval

RIFLEx adds zero measurable wall-time overhead (verified: 0.99–1.00× Dense).

Verifying engagement

After every run, the [LVSA] log line tells you exactly what the scheduler did:

[LVSA] kfi=6 global_count=14 attended_per_frame=21/81
  • kfi=6 — every 6th frame is a periodic global anchor (auto-derived)
  • global_count=14 — total global frames in the pattern (n_first + periodic)
  • attended_per_frame=21/81 — each query attends to 21 frames out of 81 → 74% sparsity

For non-default geometry, use the inline helper in docs/tuning.md to compute the budget yourself.

Diagnostics

Symptom Likely root cause Fix
No quality improvement vs Dense sparsity_scale too high at training horizon Drop to 0.5
Motion quality regressed Window too small for fast-motion prompt Try --window-size 16 (W=4)
Video loops at extension --rotate-keyframes not set Add the flag
attended_per_frame=N/T shows N==T at extension reference_latent_frames too high Verify per-model value

See lvsa-troubleshooting for the full failure-mode catalog.

Install via CLI
npx skills add https://github.com/JiusiServe/LongVideoSparseAttention --skill lvsa-tuning
Repository Details
star Stars 17
call_split Forks 3
navigation Branch main
article Path SKILL.md
More from Creator