name: audio-judge description: >- Judge generated Kokoro TTS audio clips with Gemini through llm-workflows. Use when the user asks whether synthesized speech sounds good, intelligible, whispery, corrupt, or acceptable, or when comparing PyTorch reference clips against Core ML / Swift pipeline output.
Audio Judge (Kokoro TTS)
Purpose
Fast perceptual gate for Kokoro TTS WAV clips: Gemini listens and returns a
structured verdict. Objective waveform probes run first; listening is for
needs_listening candidates and A/B comparisons.
Use When
- The user asks whether TTS output sounds acceptable, intelligible, whispery, or corrupt.
- Comparing PyTorch reference clips against Core ML / Swift pipeline output.
- A listening pack or bakeoff clip needs a smoke gate before human review.
Do Not Use When
- The task is numeric parity only — use coreml-validate first.
- The task is waveform health only with no listening need — use
scripts/audio_quality_probe.pyalone. - The FFmpeg worker is healthy and you only need timings — use bakeoff.
Ground Rules
- Read
CLAUDE.md. This skill is for kokoro-coreml (Kokoro TTS), not Crossfade music generation. - Do not judge TTS by correlation/SNR alone. Perceptual listening is the gate when metrics and ears disagree.
- Keep the boundary intact:
- FFmpeg owns media prep and Gemini file upload on the primary path.
llm-workflowsowns prompt execution and primary artifacts.- kokoro-coreml owns generation, context, and the TTS-native fallback script. The fallback is only for worker-down recovery — never replace listening with metrics.
- Kokoro rubric (from
scripts/audio_quality_probe.pyand bakeoff practice): intelligible English speech, natural prosody, no whisper/static/clicks/ dropouts, spoken text matches the input prompt when provided.
Primary-path caveat: audio_judge_v1 was written for Crossfade music.
Compensate with strong TTS framing in --prompt, --expected-style, and
--context-file. Use the primary path whenever the worker is healthy.
Preflight (before the first run)
Objective gate (cheap rejects):
uv run --no-sync python scripts/audio_quality_probe.py /path/to/clip.wavIf
quality_decisionisreject_without_listening, fix generation before Gemini. Only sendneeds_listeningor reference clips.Primary path only: invoke with
node, notpnpm run judge:audio --(the--is forwarded and breaks the script).Primary path only: convert float32 WAVs to 16-bit PCM before upload — the FFmpeg worker probing stage rejects float32:
afconvert -f WAVE -d LEI16 in.wav out.wavThe fallback script auto-converts WAVs; step 3 is not required there.
Env:
WORKFLOW_RUNTIME_TOKENandFFMPEG_CLIENT_TOKENinllm-workflows/.env(or setLLM_WORKFLOWS_ENVto that file).GEMINI_API_KEYis required for the fallback (same.envor env var). Optional:WORKFLOW_RUNTIME_BASE_URL,FFMPEG_BASE_URL.
Typical clip sources
| Label | Role | Common path |
|---|---|---|
pytorch / pytorch_3s |
Known-good PyTorch reference | outputs/audio-parity/references/pytorch_{3s,7s,15s,30s}.wav |
coreml / config_f_3s |
Swift + Core ML candidate | outputs/bakeoff/listen/config_f_{3s,7s,15s,30s}.wav |
candidate |
F0-source or parity experiment | outputs/f0_source_listening/**/wav/*_candidate.wav |
Render fresh Config F clips:
uv run --no-sync python scripts/bakeoff_listen.py --keys 3s
Bakeoff text: scripts/bakeoff_harness.py (BAKEOFF_INPUTS). Pass matching
text as --prompt.
Command (primary path)
From the llm-workflows checkout (sibling of this repo by default):
node scripts/run-audio-judge.mjs \
--clip pytorch=/absolute/path/to/pytorch_3s.wav \
--clip coreml=/absolute/path/to/config_f_3s.wav \
--baseline-label pytorch \
--prompt "The quick brown fox jumps over the dog." \
--expected-style "clear intelligible English speech, natural prosody, no whispering or static, no clicks or dropouts, 24 kHz mono" \
--context-file /absolute/path/to/kokoro_context.md
Options: --context-file (neutral facts only), --no-wait, --resume.
Context-file safety: the entire file is sent to Gemini. Never point it at
.env, credentials, or unrelated secrets.
Context file template
# Kokoro TTS audio judge context
- Task: perceptual TTS quality gate (not music).
- Sample rate: 24000 Hz mono 16-bit PCM.
- Voice: af_heart, speed: 1.0.
- Input key: 3s bakeoff sentence.
- Candidate: Config F (Swift + Core ML, compute_units=ALL).
- Baseline: PyTorch eager reference from audio-parity exports.
- Objective probe: needs_listening; rms=4600, active32=78.8%, zcr=9.1%.
- Question: Is Core ML output acceptably close to PyTorch for shipping?
Fallback (worker down only)
uv run --no-sync python scripts/gemini_audio_judge_direct.py \
--clip pytorch=/absolute/path/pytorch_3s.wav \
--clip coreml=/absolute/path/config_f_3s.wav \
--baseline-label pytorch \
--prompt "The quick brown fox jumps over the dog." \
--probe-first \
--context-file /path/to/kokoro_context.md \
--output outputs/audio-judge-fallback/<slug>.json
Fallback behavior:
- Reads
GEMINI_API_KEYfrom env orLLM_WORKFLOWS_ENV(default sibling../llm-workflows/.env). - Auto-converts WAVs to 16-bit PCM; gain-matches non-baseline clips to
baseline RMS (disable with
--no-gain-match). --probe-firstrunsaudio_quality_probebefore Gemini.- Clips must be
.wav/.mp3audio files under size limits — not arbitrary local files. --outputmust be inside this repo; writes a structured envelope withverdict, clip paths, and schema hints.
Clips upload directly to Gemini (no FFmpeg/R2). Mark fallback reports in notes. Switch back to the primary path once the worker is healthy.
Known failure modes
| Symptom | Cause | Fix |
|---|---|---|
Unknown argument: -- |
pnpm run judge:audio -- |
node scripts/run-audio-judge.mjs |
FFMPEG_CLIENT_TOKEN is required |
missing from .env |
Preflight 4 |
job fails at probing |
float32 WAV | afconvert -d LEI16, retry |
409 idempotency_conflict |
poisoned job key | new clip labels (coreml_v2=...) |
Container start failed with HTTP 500 |
worker down | fallback script |
Output
Primary: llm-workflows/outputs/audio-judge/<slug>/<timestamp>/ —
result.json, report.md, checkpoints.
Fallback: --output JSON envelope under outputs/audio-judge-fallback/.
Capture conclusions with write-notes (link artifacts, clip paths, config, primary vs fallback).
Judging protocol
- Neutral context only — facts, never expected outcome.
- Always include controls: PyTorch reference and a known-bad clip when available. Discard runs that misrank controls.
- One run = one vote; require 2–3 agreeing lineups for ship/no-ship.
- Cross-check
audio_quality_probe, waveform alignment, bakeoff parity JSON. - No ASR/WER gate unless the user explicitly asks.
Interpreting results
overallVerdict: "pass"is a smoke gate, not bakeoff readiness.- Failed Core ML + passing PyTorch → Core ML/runtime/glue, not input text.
- Failed PyTorch baseline → fix reference path before ANE debugging.
Primary path (result.json): use
comparison.iphoneAcceptablyCloseToMlx as “Core ML acceptably close to
PyTorch” when labels are coreml vs pytorch.
Fallback path (verdict envelope): use
verdict.comparisons["<label>_vs_<baseline>"].same_quality_class — not
iphoneAcceptablyCloseToMlx. The envelope documents both field names.
Related skills
- bakeoff — render Config F clips and timings.
- coreml-validate — numeric parity before listening.
- write-notes — institutional memory after verdicts.