t2m-eval - SKILL.md Agent Skill

name: t2m-eval description: > The text-to-motion evaluation recipe — FID, R-precision (top-1/2/3), Diversity, MultiModality, MM-Dist with the fixed Guo et al. matcher, plus the streaming latency/memory benchmark. Use when building the eval harness, scoring a checkpoint, or adjudicating the architecture debate.

Text-to-motion evaluation

The prior project never computed FID and flew blind. The harness exists before the first real training run, and every checkpoint gets a number.

The fixed matcher (reuse, never retrain)

Use Guo et al.'s text_mot_match (port from donor data\t2m\text_mot_match): a co-trained motion encoder + text encoder. All metrics are computed in its embedding space — retraining it makes numbers incomparable to MoMask/T2M-GPT. Load it frozen.

Metrics (HumanML3D test split)

FID — Fréchet distance between matcher-embeddings of generated vs GT motions. Primary number.
R-precision (top-1/2/3) — retrieval accuracy of matching generated motion to its text among distractors. The headline "does it follow the prompt" metric.
MM-Dist — mean text↔motion embedding distance.
Diversity — average pairwise distance across many generations.
MultiModality — average pairwise distance across generations for the same prompt.

Protocol (match the literature so numbers are citable)

Generate with the inference sampler you'll ship (non-greedy: temperature/top-p, CFG scale), not greedy — greedy collapses and reports misleadingly.
Repeat the eval over several seeds; report mean ± 95% CI (the field reports ±).
Evaluate the EMA weights, on the test split only. Guard against train/test leakage.
For the controlled twin (ADR 0001): identical matcher, split, sampler, and seeds for both transformer and SSM. Tabulate side by side.

Streaming benchmark (the thesis efficiency claim)

Peak GPU memory vs sequence length T (e.g. T = 100…4000); expect SSM ≈ flat, transformer ≈ growing.
Latency per generated chunk; time-to-first-frame. Plot both. These figures back "real-time / bounded memory".

Reporting

Always show published baselines (MoMask 0.045, T2M-GPT 0.116) as a reference row, with the honest caveat about their compute/data. Phase 2 (SMPL-X whole-body) has no standard FID — report streaming

reconstruction + qualitative and say so.