speaker-recognizer - SKILL.md Agent Skill

name: speaker-recognizer description: Self-enroll a voice for speaker recognition. Activate ONLY when (a) a mic `Unknown Speaker:` turn carries a clear self-introduction ("I'm X", "my name is X", "tôi là X"...), (b) prior same-tag `[voice:voice_N]` turns can be combined with a now-named turn, (c) a Telegram voice-note carries an intro, OR (d) the user asks to list / forget / identify voices. Do NOT activate on bare `Unknown Speaker:` turns with no name and no prior same-tag history — those need only a short ack. Self-enrollment only.

Speaker Recognizer

Each mic transcript is prefixed Speaker - Name: when recognized, or Unknown Speaker: [voice:voice_N] ... (audio save[d] at <path>...) otherwise. The audio path is the WAV of whoever spoke this turn — use it (with paths from prior same-tag turns when needed) to enroll on POST /speaker/enroll.

Self-enrollment only — never enroll one person's voice under another person's name.

Decision matrix — pick ONE action per turn

Signals in current turn	Prior same-tag turns?	Action
`Unknown Speaker:` + path + name + ≥25 words	—	Enroll now with current path only.
`Unknown Speaker:` + path + name + <25 words	≥1 prior path same `[voice:N]`	Enroll now with all same-tag paths (oldest→newest).
`Unknown Speaker:` + path + name + <25 words	none	Ask one follow-up: "say your name + ~25–30 words".
`Unknown Speaker:` + path + NO name + <25 words	none	Ask one follow-up.
`Unknown Speaker:` + path + NO name	≥1 prior path same `[voice:N]` (still no name)	Reply with a SHORT ack ("Mm, nghe rồi" / "Got it"). NEVER NO_REPLY. Don't re-ask.
`Speaker - <Name>:`	—	Already identified — skill not needed.
"who do you know?" / "list voices"	—	`GET /speaker/list`.
"forget my voice" / "remove Alex"	—	`POST /speaker/remove`.
Telegram voice note + intro	—	Convert to WAV + enroll with Telegram fields.
Telegram voice note + "who is this?"	—	`POST /speaker/recognize`.

When in doubt → see reference/enroll-flows.md. All curl + error handling → reference/api.md.

Quick enroll (mic)

curl -s -X POST http://127.0.0.1:5001/speaker/enroll \
  -H "Content-Type: application/json" \
  -d '{"name": "darren", "wav_paths": ["<path1>", "<path2>"]}'

Confirm AFTER the API returns ok: "Nice to meet you, !".

Hard rules

Self-enrollment only — "this is my friend Bob" → refuse politely; Bob must speak himself.
Lowercase normalized name — same name as face-enroll for the same person (/root/local/users/<name>/ is shared).
Minimum voice for one-turn enroll: ~25 words (aim 25–30) OR combine with prior same-tag turns to ~5–10s total.
Cluster claim is automatic — pass any path inside voice_<N>/ and the server pulls every sibling WAV. One path is enough.
Two-turn path mapping — <pathA> = turn BEFORE follow-up, <pathB> = turn AFTER. Never swap.
Telegram audio must be 16 kHz mono WAV before enroll — convert with ffmpeg -ar 16000 -ac 1; same folder as source. Skip if already .wav.
/speaker/identity (not re-enroll) when only linking Telegram info to an existing mic profile.
Don't spam "who are you?" — at most once per cluster, and include the "25–30 words" guidance in the same message.
Never go silent on Unknown Speaker fragments — when no name and you've already asked, emit a short ack. NO_REPLY is forbidden.
Confirm every enroll AFTER the API returns ok.
Don't narrate technical details — no "base64", "ffmpeg", "POST /speaker/enroll".
Never write files directly — always use the HTTP API.