evaluate-video-quality - SKILL.md Agent Skill

name: evaluate-video-quality description: Evaluate generated video quality using available metrics (SSIM, loss trajectory, caption consistency)

Evaluate Video Quality

Purpose

Assess the quality of videos generated by a training run. Combines multiple signals to give a holistic quality assessment. This skill is evolving — new metrics will be added as they are developed.

Prerequisites

Generated videos available locally or via W&B artifacts.
For SSIM: reference videos from official implementations.
For caption consistency: LLM access (optional, stub for now).

Inputs

Parameter	Required	Description
`video_paths`	Yes	List of paths to generated videos
`reference_paths`	No	Paths to reference videos (for SSIM)
`prompts`	No	Prompts used to generate videos (for caption check)
`loss_summary`	No	Path to W&B summary JSON (for loss trajectory)
`metrics`	No	Which metrics to run (default: all available)

Available Metrics

Check .agents/memory/evaluation-registry/README.md for the current catalog.

SSIM (Active)

Leverages the existing infrastructure in fastvideo/tests/ssim/.

pytest fastvideo/tests/ssim/ -vs --video-path <generated> --reference-path <reference>

Or use the SSIM utility directly:

from fastvideo.tests.ssim.ssim_utils import compute_ssim
score = compute_ssim(generated_video, reference_video)
# score > 0.85 is typically "acceptable"

Interpretation:

SSIM Range	Quality
> 0.90	Excellent — very close to reference
0.80–0.90	Good — acceptable for most uses
0.70–0.80	Fair — noticeable differences
< 0.70	Poor — significant quality issues

Loss Trajectory (Active)

Analyze the loss curve shape from W&B summary:

import json
with open(loss_summary_path) as f:
    summary = json.load(f)

final_loss = summary["train_loss"]
runtime = summary["_runtime"]
steps = summary["_step"]

Early-stage heuristics (first 500 steps):

Loss should be decreasing (even slightly).
Grad norm should be stable (no wild oscillations).
If loss is flat or increasing, flag for review.

Caption Consistency (Draft — Not Yet Calibrated)

Use an LLM to evaluate whether the video content matches the input prompt.

Prompt: "A golden retriever playing in the snow"
Video: <path>

Score the video on:
1. Object presence (is there a golden retriever?)
2. Action accuracy (is it playing?)
3. Environment match (is there snow?)
4. Overall coherence (does it look natural?)

Each 1-5, total /20.

⚠️ This metric is in draft status. Results should not be treated as ground truth until calibrated against human judgments.

Steps

Identify available metrics — Check .agents/memory/evaluation-registry/README.md.
Run each metric — Collect scores.
Aggregate — Produce a combined quality report.
Log — Update the experiment journal with quality results.

Outputs

## Video Quality Report: <experiment_name>

| Metric | Score | Threshold | Status |
|--------|-------|-----------|--------|
| SSIM (avg) | 0.87 | > 0.80 | ✅ Pass |
| Loss trajectory | decreasing | decreasing | ✅ Pass |
| Caption consistency | 16/20 | > 14/20 | ✅ Pass |

### Per-Video Scores
| Video | SSIM | Caption |
|-------|------|---------|
| video_001.mp4 | 0.89 | 17/20 |
| video_002.mp4 | 0.85 | 15/20 |

References

fastvideo/tests/ssim/ — SSIM test infrastructure
fastvideo/tests/training/Vanilla/test_training_loss.py — loss comparison
.agents/memory/evaluation-registry/README.md — metric catalog

Changelog

Date	Change
2026-03-02	Initial version with SSIM, loss trajectory, caption consistency stub