dialogue-audio - SKILL.md Agent Skill

name: dialogue-audio description: "Multi-speaker dialogue guide. Script format, voice assignment, pacing for elevenlabs-dialogue." allowed-tools: [] disable-model-invocation: false

Multi-Speaker Dialogue Audio Guide

Use /pocket-knife:elevenlabs-dialogue to produce audio with multiple distinct voices in a single request. Supports up to 10 unique voices per generation. Output is a single MP3 with natural conversation flow.

How Dialogue Generation Differs from TTS

Feature	TTS (`elevenlabs-tts`)	Dialogue (`elevenlabs-dialogue`)
Speakers	1	Up to 10
Natural turn-taking	No	Yes — model handles transitions
Interruptions/overlap	No	Partial support
Use case	Narration, announcements	Conversations, interviews, podcasts

The dialogue endpoint produces more natural pauses and prosodic transitions between speakers than stitching multiple TTS files together manually.

Available Voices

Name	Voice ID	Accent	Gender
George	JBFqnCBsd6RMkjVDRZzb	British	Male
Rachel	21m00Tcm4TlvDq8ikWAM	American	Female
Aria	9BWtsMINqrJLrRacOk9x	American	Female
Sarah	EXAVITQu4vr4xnSDxMaL	American	Female
Charlie	IKne3meq5aSn9XLyUdCD	Australian	Male

Choose voices that contrast clearly — avoid assigning two similar-sounding voices (e.g., Rachel and Aria) to characters the listener needs to distinguish quickly.

Script Formatting for Dialogue

When providing your script, use the Speaker: Line format, one line per turn:

Alice: Have you tried the new product yet?
Bob: I tested it last week, actually. It exceeded my expectations.
Alice: Really? What stood out to you?
Bob: The speed. It processed everything in under two seconds.

Each line becomes one segment in the inputs array with the assigned voice_id.

Voice Assignment Strategy

2-speaker conversation (interview/podcast):

Host: George (authoritative, structured)
Guest: Charlie (relaxed, conversational)

2-speaker conversation (educational):

Teacher: Rachel (clear, measured)
Student: Aria (curious, natural)

3-speaker debate:

Moderator: George (neutral, formal)
Speaker A: Rachel (confident)
Speaker B: Charlie (casual counterpoint)

Customer service training:

Customer: Aria (frustrated or confused tone through writing)
Agent: Sarah (helpful, calm)

Pacing and Timing Through Text

The model infers pacing from punctuation and sentence structure. Use these techniques:

Short lines = faster pace:

Bob: Agreed.
Alice: Let's move on.
Bob: Good idea.

Longer lines = measured, thoughtful delivery:

Alice: When we reviewed the data last quarter, the trends pointed clearly toward a significant shift in user behavior that we hadn't anticipated in our initial projections.

Emotional cues through word choice — write dialogue that signals the emotion:

Bob: I... I can't believe this happened. (pause through ellipsis)
Alice: Take your time. (short line = patient, calm)

Practical Use Cases

Use case	Speakers	Suggested voices
Podcast episode	2	George + Charlie
E-learning scenario	2	Rachel + Aria
Product demo narration	2	Sarah + George
Audio drama (3 chars)	3	George + Rachel + Charlie
Customer support training	2	Aria + Sarah
Explainer with Q&A	2	George + Rachel

Line Length Recommendations

Speaker turn	Recommended word count
Short reply	5–20 words
Standard reply	20–60 words
Long explanation	60–120 words

Avoid single segments over 150 words — break long explanations into two turns for more natural pacing.

Structuring a Full Episode

For a 5-minute podcast segment (approx. 750 words of dialogue):

Opening hook — 2–3 short exchanges to establish character voices
Main content — alternating turns, each 30–80 words
Key insight moment — one speaker delivers a 100–120 word uninterrupted point
Reaction and discussion — 4–6 short exchanges
Closing — 2 lines each, clear wrap-up

Keep total segments under 25 per request for reliability. For longer content, split into multiple requests and concatenate the MP3 files.

Workflow with Other Skills

Generate transcript with an LLM → produce audio with /pocket-knife:elevenlabs-dialogue
Add background music: generate with /pocket-knife:elevenlabs-music → layer in an audio editor
Transcribe an existing conversation with /pocket-knife:elevenlabs-stt → adapt as a script → re-voice