name: lai-diarize
description: Identify speakers ("who said what") in aligned captions via pyannote.audio. Real speaker names come from the agent's own reasoning over transcript + context (default), with a CLI-LLM fallback for headless runs. Trigger on multi-speaker content (podcasts, interviews, meetings) or phrases like "diarize", "speaker detection", "说话人识别", "区分说话人", "label the speakers". Requires aligned captions — run /lai-align first.
Speaker Diarization
Preferred model: Claude Sonnet (cost-efficient for agent-driven naming). This skill runs on whatever model is active in the parent session — any Claude model works; no hard switch. Sonnet has no 1M-context variant, so if the parent session is Opus[1M], continuing on Opus is normal (avoids a no-op model swap).
Adds speaker labels to aligned captions. Speaker detection (who speaks when) is always CLI-based via pyannote.audio; speaker naming (who each one is) is agent-driven by default.
Basic Command
<base> = source media stem (e.g. podcast from podcast.mp3) or YouTube ID. Files all land in the current directory:
lai diarize run podcast.mp3 podcast.aligned.json podcast.diarized.json
# shortcut:
lai-diarize podcast.mp3 podcast.aligned.json podcast.diarized.json
Output labels detected speakers as SPEAKER_00, SPEAKER_01, …
Speaker count is auto-detected. Override only when auto-detection is clearly wrong:
diarization.num_speakers=3— exact count (when known)diarization.min_speakers=N/diarization.max_speakers=N— bound the search
Giving Speakers Real Names
Agent-driven (default)
After the basic command finishes, the agent reads the diarized output (the file you wrote with output_caption=…) together with any available context, and writes the named result. You may write the named version back into the same path (in-place edit) or to a separate path — depends on your project's convention.
Two-file convention (preferred when state matters, e.g. CI pipelines): emit the acoustic-only output as diarized.raw.json and let the agent write the named result to diarized.json. This keeps "acoustic切分 done" distinct from "named, ready for publish," and lets downstream stages hard-fail when the agent hasn't run yet. The ai-podcast-pipeline repo follows this convention (see its CLAUDE.md).
Signals the agent uses:
- Explicit context the user provides in the conversation
meta.mdbeside the source (YAML frontmatter, format below)- Transcript evidence — self-introductions ("I'm Alice…"), mutual addressing ("thanks, Bob"), host/guest dynamics, topical expertise
- Existing inline labels in the source text (
[Alice],>> Bob:,SPEAKER_01:) — preserved by the CLI and matched by the agent - Speaker-change markers in
supervision.custom— see Forward Search below
Process:
- Read
diarized.json— collect uniqueSPEAKER_XXids and sample 3–5 segments per speaker - Gather context (inline hints, meta.md, transcript clues)
- Map each
SPEAKER_XX→ real name with a confidence note. If unsure, keepSPEAKER_XXrather than guessing - Resolve ghost tiers via forward search (next subsection) before finalizing
- Rewrite the
speakerfield across all segments; do not touchtext,start,end, or segment order - Show the user the mapping before finalizing if any mapping is uncertain
Forward Search via >> / speaker-change markers
VTT and SRT broadcast captions encode speaker turns with markers like >>
(usually escaped as >> in raw VTT), <v Speaker>, [Speaker], or all-caps
lead-ins. LattifAI preserves whatever marker it found in supervision.custom:
"custom": {
"original_speaker": ">>",
"speaker_change": true
}
Key insight: >> alone (no trailing name) is still a strong signal — the
captioner asserts a new speaker starts here. When the resolved speaker for
such a segment is still SPEAKER_XX / Unknown / empty (typically a 1–3 segment
"ghost tier" that pyannote couldn't merge into a main cluster), don't leave it
unnamed. Run forward search:
- Walk forward from this segment through same-
SPEAKER_XXneighbors and onwards through later segments after>>boundaries. - Stop at the first identity anchor, in priority order:
- Self-introduction — "I'm X", "My name is X", "I'm a Y at Z" (match
Z against
meta.mdaffiliationfields) - Cross-address — an adjacent speaker says "Thanks, X", "X, what do you think?", "Let me hand it to X"
- Topic ownership — domain reference that pins exactly one speaker
in
meta.md(e.g. "in my RNA work…" → host withaffiliation: "Atomic AI")
- Self-introduction — "I'm X", "My name is X", "I'm a Y at Z" (match
Z against
- Backfill the anchor's real name into the originating
>>segment. - Fallback — if no anchor is found before the speaker turn ends (next
>>or end of file), keep the segment asSPEAKER_XXrather than guessing.
Dominant-neighbor merge (when >> is absent): tiers with ≤3 segments and no
speaker-change marker are usually pyannote boundary artifacts. If such a
segment is sandwiched between two segments of the same real speaker, attribute
it to that speaker — short interjections ("Yes.", "Yeah.", "Right.") don't
carry identity, and the acoustic edge is more likely segmentation noise than a
third party.
meta.md (optional but strong signal — drives both num_speakers and forward-search topic anchors). All fields below are parsed by both the agent-driven path and the CLI-LLM fallback (lai diarize naming / diarize run):
---
title: "Deep Dive into LLMs"
speakers:
- name: Alice Chen
role: host
affiliation: "Anthropic (research engineer)" # self-introduction & topic-ownership anchor
aliases: ["Alice"] # short forms LLM should fold back to full name
bio: "Host of the show. Background in distributed systems."
- name: Bob Smith
role: guest
affiliation: "Stanford AI Lab"
aliases: ["Bob", "Bobby"]
bio: "PhD candidate working on RLHF and scaling laws."
topics: ["RLHF", "scaling laws", "alignment"] # episode-level keyword hints
prior_episodes:
- "Episode 42: pretraining — same guest, covers scaling laws"
---
Keep name clean (no ", OpenAI" suffix) — put organizations in affiliation so
the agent can match self-introductions ("I'm a researcher at Stanford" → Bob) to
exactly one speaker, and downstream slug resolvers don't break on commas. aliases
let the LLM map cross-references like "thanks, Swyx" back to the full legal name
instead of inventing a third speaker; bio and topics give the LLM
episode-specific expertise to anchor topical references against.
CLI-LLM fallback (headless / automated runs)
When the agent is not in the loop (batch pipelines, CI, unattended scripts), let the CLI do name inference with its own LLM backend:
lai config set diarization.llm.model_name gemini-3-flash-preview # one-time
# Gemini key: see /lai-transcribe
lai diarize run --direct -Y \
podcast.mp3 podcast.aligned.json podcast.diarized.json \
diarization.infer_speakers=true \
diarization.llm.reasoning=true
diarization.infer_speakers=true— enable CLI-side name inference (requires LLM config above)diarization.llm.reasoning=true— ask the LLM to show its reasoning before committing to a name; trades latency for accuracy on ambiguous speakers
You can also pass hints at invocation time without any LLM:
lai diarize run podcast.mp3 podcast.aligned.json podcast.diarized.json \
context="Host: Alice Chen (tech journalist), Guest: Bob Smith (AI researcher)"
# or point at a meta.md (first positional `context` arg also accepts a file path):
lai diarize run podcast.mp3 podcast.aligned.json podcast.diarized.json context=podcast.meta.md
Output
Each supervision gains a speaker field:
{ "text": "Welcome to the show.", "start": 0.0, "end": 2.5, "speaker": "Alice Chen" }
Common Issues
| Problem | Fix |
|---|---|
No aligned segments |
Run /lai-align first |
| Too many speakers detected (ghost tiers) | Pre-empt: pass diarization.num_speakers=N from meta.md. Post-hoc: dominant-neighbor merge (see Forward Search section) |
| Tiny tier (1–3 segments) of short interjections | Pyannote boundary noise — dominant-neighbor merge into the surrounding speaker, don't treat as a real third party |
>> segment left as SPEAKER_XX |
Run forward search (see above); only keep SPEAKER_XX if no anchor exists within the speaker turn |
| Agent can't confidently name a speaker | Keep SPEAKER_XX and ask the user — don't guess |
name field contains org (e.g. "Alex Lupsasca, OpenAI") |
Split into name: "Alex Lupsasca" + affiliation: "OpenAI" — comma in name breaks slug resolution downstream |
| Headless run, no LLM configured | lai config set diarization.llm.model_name gemini-3-flash-preview |
Related Skills
/lai-align— produce the aligned input (required)/lai-transcribe— transcript from scratch/lai-translate,/lai-summarize— run on diarized output for speaker-aware results