lvsa-quickstart - SKILL.md Agent Skill

name: lvsa-quickstart description: Install LVSA and generate your first long video. Use when setting up LVSA from scratch, picking SDPA vs FlashInfer backend, configuring LVSA_REFERENCE_LATENT_FRAMES for a model, or verifying the sparse path engaged via [LVSA] log lines.

LVSA Quickstart

Overview

LVSA (Long Video Sparse Attention) is a training-free block-sparse attention engine for video diffusion transformers (Wan 2.x, HunyuanVideo 1.5, Cosmos 3.0 (experimental), CogVideoX). It accelerates long-video generation 1.4–3.8× and enables generation beyond the training horizon where dense attention OOMs on 80 GB GPUs. Code: https://github.com/JiusiServe/LongVideoSparseAttention.

This skill covers: installation, choosing a backend, setting the per-model reference, running a first generation, and verifying that LVSA is actually engaged (not silently falling back to dense).

Install

git clone https://github.com/JiusiServe/LongVideoSparseAttention
cd LVSA

# Use uv (or any venv tool)
uv venv --python 3.12
source .venv/bin/activate

# Core library
uv pip install -e ".[diffusers,hunyuan,dev]"

# FlashInfer backend (optional but recommended on CUDA — fastest at long sequences)
uv pip install -e ".[flashinfer]"
# Requires nvcc + CUDA toolkit for JIT compilation.

# vllm-omni plugin (optional — only for serving; use a separate .venv-vllm).
# vllm-omni 0.22.0 is a stable release built from the git tag to match vllm
# 0.22.0. For the older pip-installable 0.18.0 pair, use the release/v0.18.x branch.
uv pip install -e lvsa-vllm-omni/
uv pip install "vllm==0.22.0"
uv pip install --no-build-isolation \
  "vllm-omni @ git+https://github.com/vllm-project/vllm-omni.git@v0.22.0"

# VQeval (optional — for quality benchmarking)
uv pip install -e vqeval/

Verify:

pytest tests/ lvsa-vllm-omni/tests/ -v
# Expect 280+ tests passing (CPU-only, no GPU needed)

Pick a backend

Backend	When to use	Hardware
SDPA (default)	Always available; reasonable at training horizon	CUDA + Ascend NPU
FlashInfer	Fastest at extension (T_lat ≥ 49)	CUDA only (needs nvcc)

The --flashinfer flag enables FlashInfer; otherwise SDPA. If FlashInfer install fails on your box (no nvcc / wrong CUDA), the dispatcher will refuse to fall back silently — it'll raise a clear error and you can re-run without --flashinfer.

Pick the right reference per model

This is the single most common LVSA configuration mistake. Always set explicitly.

Model	`LVSA_REFERENCE_LATENT_FRAMES`	Video frames at 1×
Wan 2.1 / 2.2 (1.3B, 14B, A14B)	`21`	81
Wan 2.2 TI2V-5B (high-compression VAE)	`31`	121
HunyuanVideo 1.5	`33`	129
Cosmos 3.0 (Nano, experimental)	`48`	189 (@720p)
CogVideoX 5B	`13`	49

For the standalone example scripts the value is wired automatically by lvsa/adapters/<model>.py::reference_latent_frames() (or, for Cosmos 3.0, by the COSMOS3_REFERENCE_LATENT_FRAMES=48 default in lvsa/cosmos3.py). Override only if your fork uses non-default geometry. Cosmos 3.0 standalone runs via examples/cosmos_generate.py (single-GPU, SDPA, needs diffusers main); the plugin path (lvsa-vllm-omni skill) additionally gives Cosmos FlashInfer + multi-GPU (TP/CFG/PP/HSDP).

Run a first generation

Single GPU, training horizon (no extension)

python examples/wan_generate.py \
    --model /path/to/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A dog running in the forest." \
    --num-frames 81 \
    --lvsa --auto-keyframes \
    --output-name dog_1x.mp4

At T_lat ≤ reference, the auto-scheduler returns kfi=1 → fully-dense attention via the LVSA path. You get the implementation-bypass speedup (~1.5–2×) but no pattern-driven sparsity.

Single GPU, 4× horizon (the headline case)

python examples/wan_generate.py \
    --model /path/to/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A dog running in the forest." \
    --num-frames 321 \
    --lvsa --flashinfer --rotate-keyframes --auto-keyframes \
    --output-name dog_4x.mp4

Add --riflex --riflex-s 4.0 to stack RIFLEx RoPE rescaling on top.

Multi-GPU (Ulysses context parallel)

torchrun --nproc_per_node=2 examples/wan_generate.py \
    --model /path/to/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "A dog running in the forest." \
    --num-frames 481 \
    --lvsa --flashinfer --rotate-keyframes --auto-keyframes \
    --output-name dog_6x.mp4

Constraint: seq_len = T_lat × patches_per_frame must be divisible by nproc_per_node.

Verify LVSA engaged

After the run, look for these [LVSA] lines (the LVSA prints prefix with [LVSA] for standalone, [LVSA] for vllm-omni):

[LVSA] --rotate-keyframes: computed key_frame_interval=6 (latent frames)
[LVSA] total_lat_frames=81 local_seq=126360 rank0_frames~[0,80] window=3 n_first=1 kfi=6 global_count=14 attended_per_frame=21/81
[LVSA] installed on 30 blocks num_patches=1560 total_lat_frames=81 backend=FlashInfer CSR: MB=81 nnz=1701 density=25.9% block_size=1560 compact=81/81frames

Read it as:

attended_per_frame=N/T — N out of T frames are in each query's attention set. Sparsity = 1 − N/T. At training reference, N==T (dense).
backend=FlashInfer|SDPA — confirms the requested backend was selected.
installed on N blocks — number of attention layers LVSA wrapped.

If you see no [LVSA] lines, the --lvsa flag wasn't passed. If attended_per_frame=N/T shows N == T at extension lengths, your --num-frames is below the training reference (no real sparsity engaged) — or the geometry detection failed.

Adding NPU support

On Ascend NPU, the SDPA path works automatically through torch_npu. lvsa/device.py detects NPU and routes memory probes through torch.npu.*. FlashInfer is CUDA-only and won't run on NPU.

No NPU-specific custom kernels ship in v1.0 — only the SDPA path is exercised on NPU.

Common first-run gotchas

Symptom	Fix
"ModuleNotFoundError: lvsa"	`pip install -e .` from repo root
FlashInfer JIT fails ("Could not find nvcc")	Install CUDA toolkit (nvcc) or drop `--flashinfer`
Generated mp4 missing in Docker	Pass relative paths; bind-mount the repo; see `lvsa-troubleshooting` skill
`attended_per_frame=N/T` shows N=T at 4×	`LVSA_REFERENCE_LATENT_FRAMES` is too high for the model — verify against the table above

See lvsa-troubleshooting for the full failure-mode catalog.