name: seed-ssim-references
description: Seed HF reference artefacts for a single newly-added SSIM test (pixel .mp4 for run_text_to_video_similarity_test-style tests, or latent .pt for run_text_to_latent_similarity_test-style tests). Runs the test on Modal L40S, downloads the generated artefacts via modal volume get, pauses for the user to verify (visual eyeball for mp4, numerics dump for pt), then uploads only that test's files to FastVideo/ssim-reference-videos. Use when a new fastvideo/tests/ssim/test_*_similarity.py has just been added and has no references on HF yet.
Seed SSIM Reference Artefacts (mp4 or pt)
Purpose
A brand-new SSIM test in fastvideo/tests/ssim/ fails forever until its
reference artefacts exist on the HF dataset
(FastVideo/ssim-reference-videos). The dataset hosts two kinds of artefacts
side-by-side per (model_id, backend, prompt):
.mp4— pixel ground-truth for tests that callrun_text_to_video_similarity_test/run_image_to_video_similarity_testininference_similarity_utils.py. Compared via SSIM..pt— pre-VAE latent bundle (fp16 full latent + fp32 slice + metadata +slice_spec+format_version) for tests that callrun_text_to_latent_similarity_testinlatent_similarity_utils.py. Compared via cosine distance on the slice and the full tensor.
This skill:
- Detects which artefact type the test produces (pixel vs latent).
- Runs the test on Modal's L40S pool to generate the artefacts.
- Downloads them to the local repo via
modal volume get. - Pauses so the user can verify quality:
- mp4: visual eyeball in a video player.
- pt: numerics dump (shape, slice stats, NaN/Inf check, metadata).
- Uploads only the new test's files to HF, with a guard that refuses to overwrite anything already present.
The skill is run manually, once per new test. Before invoking it, the user
has already sanity-tested the new test locally — it launches VideoGenerator
and writes an artefact without crashing (the missing-reference assertion at
the end is expected). The skill does not re-test locally; it goes straight
to Modal L40S (which is what CI uses).
When to use
- A new
test_*_similarity.pyfile has been added infastvideo/tests/ssim/and the HF dataset has noreference_videos/default/L40S_reference_videos/<model_id>/subtree for it yet.
When not to use
- Regular CI runs — once refs exist,
pytest fastvideo/tests/ssim/downloads them automatically. - Re-seeding an existing test. That requires
--forceon the upload step, and is out of scope here; treat as a separate, deliberate operation.
Inputs
The skill has one required input: the path to the new SSIM test file. Prompt the user for it if they didn't supply it.
| Parameter | Required | Description |
|---|---|---|
test_file |
Yes | e.g. fastvideo/tests/ssim/test_ltx2_similarity.py. The skill's first action is to ask for this if missing. |
Everything else is fixed:
- Modal runner GPU: L40S (hardcoded in
fastvideo/tests/modal/ssim_test.py). - Device folder:
L40S_reference_videos. - Quality tier:
default(the tier CI runs). Thefull_qualitytier is not seeded by this skill. - HF repo:
FastVideo/ssim-reference-videos(dataset). - Multi-model test files: all model ids in
*_MODEL_TO_PARAMSare seeded together; the Modal run produces one mp4 per (model, prompt, backend) and the upload scopes by--model-id, looping if there is more than one.
Prerequisites
The user has confirmed:
modalCLI authenticated.HF_API_KEY(orHUGGINGFACE_HUB_TOKEN/HF_TOKEN) exported with write access toFastVideo/ssim-reference-videos.- The test file runs locally end-to-end (generates an mp4; SSIM assertion failure due to missing reference is expected and fine).
Fail fast if the token env var is missing.
Steps
1. Ask for the test file, then detect artefact type
If the user didn't name one, ask: "Which SSIM test file do you want to seed
references for? (e.g. fastvideo/tests/ssim/test_ltx2_similarity.py)".
Validate:
- Path exists and matches
fastvideo/tests/ssim/test_*_similarity.py. - File defines a
*_MODEL_TO_PARAMSdict — grep it to extract the set of model ids. Those ids drive step 5.
Detect artefact type by inspecting the file's imports / helper call:
- latent (
.pt) — file importsrun_text_to_latent_similarity_testfromfastvideo.tests.ssim.latent_similarity_utils(or any other helper that ends with_latent_similarity_test). - pixel (
.mp4) — file importsrun_text_to_video_similarity_test/run_image_to_video_similarity_testfromfastvideo.tests.ssim.inference_similarity_utils, OR uses the legacy custom-inline helper pattern (seetest_gamecraft,test_longcat, etc.). Default to pixel when both heuristics fail.
Record ARTEFACT_TYPE ∈ {pixel, latent} for use in step 4. Steps 2, 3, 5,
and 6 are artefact-type-agnostic — _iter_reference_files,
copy_generated_to_reference, and upload_reference_videos already walk
both .mp4 and .pt (see reference_videos_cli.py).
If either check fails, stop and tell the user what's wrong.
2. Run the test on Modal L40S
Pick a subdir name so repeated runs don't collide:
SHORT_COMMIT=$(git rev-parse --short=12 HEAD)
TIMESTAMP=$(date -u +%Y%m%d_%H%M%S)
SUBDIR="${TIMESTAMP}_${SHORT_COMMIT}"
Then launch the Modal run. The IMAGE_VERSION and BUILDKITE_* env-prefix
must match what CI exports in .buildkite/scripts/pr_test.sh, otherwise
fastvideo/tests/modal/ssim_test.py resolves a different GHCR image tag
(default is latest, CI is py3.12-latest) and bakes different values into
the image's frozen env block (ssim_test.py:17-18, 38-46). Mismatched image
or env produces SSIM drift that doesn't show up until the same commit runs
in CI.
IMAGE_VERSION="py3.12-latest" \
BUILDKITE_REPO="$(git config --get remote.origin.url)" \
BUILDKITE_COMMIT="$(git rev-parse HEAD)" \
BUILDKITE_PULL_REQUEST="${BUILDKITE_PULL_REQUEST:-false}" \
modal run fastvideo/tests/modal/ssim_test.py \
--git-repo="$(git config --get remote.origin.url)" \
--git-commit="$(git rev-parse HEAD)" \
--hf-api-key="$HF_API_KEY" \
--test-files="<test_file>" \
--sync-generated-to-volume \
--generated-volume-subdir="$SUBDIR" \
--skip-reference-download \
--no-fail-fast
Env prefix rationale (parity with CI; see .buildkite/pipeline.yml:1-3 and
.buildkite/scripts/pr_test.sh:62-83):
IMAGE_VERSION=py3.12-latest: pins the Modal image tag to the same one CI uses. Without this,ssim_test.py:17falls back tolatest, which on GHCR is built fromDockerfile.python3.10— different Python, torch, and flash-attn wheel than CI'spy3.12-latest(infra-build-image.yml:51-67,_template-build-image.yml:65-101).BUILDKITE_REPO/BUILDKITE_COMMIT/BUILDKITE_PULL_REQUEST: mirror what Buildkite exports.ssim_test.py:38-46bakes these into the image's.env(...)block; mismatched values can perturb in-container code paths that branch on PR-vs-non-PR.falseforBUILDKITE_PULL_REQUESTmatches Buildkite's "non-PR build" sentinel.
Flag rationale:
--skip-reference-download: no refs exist yet, so conftest must not try to pull them.--no-fail-fast: lets the test finish generation before_assert_similarityraisesFileNotFoundError: Reference video folder does not exist. The expected failure is what we want — the mp4 has already been written.--sync-generated-to-volume+--generated-volume-subdir: copies the generated mp4s to thehf-model-weightsModal volume underssim_generated_videos/default/<SUBDIR>/generated_videos/so we can pull them locally.
The Modal run will end with a nonzero exit (expected) and print a
modal volume get hf-model-weights ssim_generated_videos/default/<SUBDIR>/generated_videos ./generated_videos_modal/default
command. Capture that <SUBDIR> — you need it for step 3.
3. Download generated videos locally
modal volume get --force hf-model-weights \
ssim_generated_videos/default/"$SUBDIR"/generated_videos \
./generated_videos_modal/default
--force is required when the parent ./generated_videos_modal/default
already exists; without it, modal volume get errors with [Errno 21] Is a directory. Safe to pass on the first run too.
After this, the mp4s live at
./generated_videos_modal/default/generated_videos/L40S_reference_videos/<model_id>/<backend>/<prompt>.mp4.
The extra generated_videos/ level comes from the volume layout in
_sync_generated_videos_to_volume (ssim_test.py) — the command copies
<repo>/fastvideo/tests/ssim/generated_videos/<tier> to
ssim_generated_videos/<tier>/<SUBDIR>/generated_videos/, and modal volume get preserves that trailing generated_videos/ segment.
4. PAUSE — user reviews quality
Type-aware verification.
For ARTEFACT_TYPE = pixel — list the downloaded mp4s and ask the user to
open them in a video player:
"Generated videos downloaded to
./generated_videos_modal/default/generated_videos/L40S_reference_videos/. Please open them and confirm the quality looks correct. Replyuploadto continue, or anything else to abort."
For ARTEFACT_TYPE = latent — .pt files are not human-watchable. Print
a numerics dump for each .pt so the user can sanity-check shape, distribution,
and metadata:
import torch
from pathlib import Path
ROOT = Path("./generated_videos_modal/default/generated_videos/L40S_reference_videos")
for p in sorted(ROOT.rglob("*.pt")):
d = torch.load(p, map_location="cpu", weights_only=False)
s = d["expected_slice"]
L = d["latent"].float()
print(f"=== {p.relative_to(ROOT)} ===")
print(f" format_version: {d['format_version']}")
print(f" shape: {d['shape']}")
print(f" dtype_original: {d['dtype_original']}")
print(f" slice_spec: {d['slice_spec']}")
print(f" slice shape={tuple(s.shape)} mean={s.mean():+.4f} std={s.std():.4f} min={s.min():+.4f} max={s.max():+.4f}")
print(f" latent shape={tuple(L.shape)} mean={L.mean():+.4f} std={L.std():.4f} min={L.min():+.4f} max={L.max():+.4f}")
print(f" finite: latent NaN={torch.isnan(L).any().item()} Inf={torch.isinf(L).any().item()}; "
f"slice NaN={torch.isnan(s).any().item()} Inf={torch.isinf(s).any().item()}")
print(f" metadata: {d['metadata']}\n")
Sanity criteria:
format_version == 1(matchesLATENT_REFERENCE_FORMAT_VERSION).shapematches what the model produces (e.g. LTX-2 distilled =[1, 128, T_lat, H_lat, W_lat]; Stable Audio Open 1.0 =[1, 64, 1024]).slice_spec.kindmatches a registered kind (corner_3x3_first_framefor video,audio_first_8_timestepsfor audio).- No
NaN/Inf.mean ≈ 0,std ≈ 1(denoised latents stay close to the initial Gaussian distribution; very wide deviations suggest numerical drift). metadata.promptmatches the test's prompt.
Then ask:
"Numerics look right? Reply
uploadto continue, or anything else to abort."
Do not proceed until the user explicitly says upload. If they abort, leave
everything on disk so they can inspect further — no cleanup.
5. Copy into the local reference layout
Scoped copy — only the new test's artefacts. Single command works for both
artefact types because _iter_reference_files walks .mp4 and .pt:
python fastvideo/tests/ssim/reference_videos_cli.py copy-local \
--quality-tier default \
--device-folder L40S_reference_videos \
--generated-dir ./generated_videos_modal/default/generated_videos/L40S_reference_videos
(The --generated-dir points at the device-folder root inside the
downloaded tree; copy-local walks all <model>/<backend>/*.{mp4,pt}
underneath it. Since the Modal run was scoped to a single test file via
--test-files, only that test's model(s) are present — so the copy is
implicitly per-test.)
Result for pixel: fastvideo/tests/ssim/reference_videos/default/L40S_reference_videos/<model_id>/<backend>/<prompt>.mp4.
Result for latent: same path with .pt extension.
6. Upload to HF — scoped per model_id, with overwrite guard
For each <model_id>:
python fastvideo/tests/ssim/reference_videos_cli.py upload \
--quality-tier default \
--device-folder L40S_reference_videos \
--model-id "<model_id>"
The upload command:
- Uploads only
reference_videos/default/L40S_reference_videos/<model_id>/. - Refuses if any file already exists at that path on HF (this is the
guard — seeding a new test should never clobber existing refs). To override,
the user must re-run with
--force. If the guard fires, stop and report exactly which files exist; do not silently--force.
Reads the HF token from HF_API_KEY / HUGGINGFACE_HUB_TOKEN / HF_TOKEN.
7. Report success
List what was uploaded (paths in repo) and remind the user to push any
related code changes. Do not auto-verify by re-running Modal — the user
can run pytest fastvideo/tests/ssim/<test_file> later to confirm end-to-end;
it will auto-download the refs they just uploaded.
Failure modes and how to handle them
HF_API_KEYunset. Stop before step 2. The Modal run needs it (passed via--hf-api-key), and step 6 needs it for upload. If the user ranhf auth logininstead of exporting an env var, read the cached token viahuggingface_hub.get_token()and forward it to Modal as--hf-api-key="$CACHED_TOKEN".- Modal run fails before generation. No artefacts on the volume — nothing
to download. Fix the test locally (
pytest fastvideo/tests/ssim/<test_file>) and retry from step 2. ./generated_videos_modal/default/L40S_reference_videos/missing aftermodal volume get. The run didn't produce artefacts (most likely the test crashed before writing, orREQUIRED_GPUSexceeded the partition capacity — see Modal logs).- Latent test crashed with FSDP / inference_mode error
(
RuntimeError: Inference tensors do not track version counter). The test must passinit_kwargs_override={"use_fsdp_inference": False}whensp_size == 1— seetest_stable_audio_similarity.pyfor the pattern. Fix in the test, push, retry. - Upload guard fires (files already exist). The test name / model id
collides with something already on HF. Verify the user actually wants to
replace existing refs; if so, re-run the upload with
--force. If not, rename the model id in*_MODEL_TO_PARAMSand re-seed. - Quality looks wrong in step 4. Abort. The artefacts stay on disk for
inspection. The fix is usually in the test's params (resolution, steps,
seed) — edit the test, then re-run the skill.
- For latent: also check
slice_spec.kindmatches the latent rank (corner_3x3_first_framerequires 5-D,audio_first_8_timestepsrequires 3-D); a rank/kind mismatch raises in_extract_expected_slice.
- For latent: also check
Design notes (for future skill maintainers)
- The skill deliberately runs on Modal, not locally, because the CI runner is L40S. Seeding from a different GPU SKU produces refs that CI's L40S runs can't match (pixel SSIM drifts across SKUs; latent cosine has tighter cross-SKU bf16 drift but the configured tolerances assume same-SKU seed → same-SKU verify).
- The skill is default-tier only.
full_qualityrefs are seeded by a separate, deliberate operation — they double runtime and aren't what CI gates on. - The overwrite guard in
reference_videos_cli.py uploadis default-on specifically because this skill exists. Re-seeding is a distinct operation that requires explicit--force. - Both artefact types share the same Modal flow: the orchestrator sets
--skip-reference-download+--no-fail-fast, runs pytest, the test's helper writes the artefact (.mp4viaimageiofor pixel,save_latent_reference→torch.savefor latent) BEFORE the missing-reference assertion raises._sync_generated_videos_to_volumeinssim_test.pydoes ashutil.copytreeof the wholegenerated_videos/tree, picking up.mp4,.pt, and the*_ssim.json/*_latent.jsonmetric files alongside.
References
fastvideo/tests/modal/ssim_test.py— Modal orchestrator; see--sync-generated-to-volume,--generated-volume-subdir,--skip-reference-download,--no-fail-fast.fastvideo/tests/ssim/reference_videos_cli.py—copy-local,upload(with--model-id,--force),download,ensuresubcommands. Extension allowlist isREFERENCE_EXTENSIONS = VIDEO_EXTENSIONS + LATENT_EXTENSIONS(.pt).fastvideo/tests/ssim/README.md— reference layout, HF repo conventions.fastvideo/tests/ssim/inference_similarity_utils.py— pixel helpers (run_text_to_video_similarity_test,run_image_to_video_similarity_test,build_init_kwargs).fastvideo/tests/ssim/latent_similarity_utils.py— latent helper (run_text_to_latent_similarity_test), slice spec dispatch (_extract_expected_slice), reference schema (save_latent_reference/load_latent_reference),LATENT_REFERENCE_FORMAT_VERSION.
Changelog
| Date | Change |
|---|---|
| 2026-04-17 | Initial version (Modal sync-to-volume flow). |
| 2026-04-21 | Rewrite: single-test scope, explicit user-review pause, per-model_id upload, HF overwrite guard. Dropped scripts/seed_ssim.sh. |
| 2026-04-21 | Post-first-run fixes: modal volume get needs --force when parent exists; download tree has an extra generated_videos/ level so --generated-dir must reflect it. |
| 2026-05-01 | Latent (*.pt) artefact support: artefact-type detection in step 1, type-aware verification (visual eyeball for mp4, numerics dump for pt) in step 4, FSDP+inference_mode failure-mode added, design notes for the unified Modal flow. Triggered by PR #1253 (LTX-2 latent migration + Stable Audio latent test). |