name: fish-audio description: "Generate expressive audio clips using Fish Audio S2 TTS with bracket emotion tags. Record voice memos, narration, audio messages, or any spoken content." compatibility: "Designed for Vellum personal assistants" metadata: emoji: "๐๏ธ" vellum: category: "voice" display-name: "Fish Audio TTS"
Fish Audio TTS
Generate expressive audio clips using the Fish Audio S2 TTS API with [bracket] emotion tags.
Overview
This skill lets you create audio clips on demand โ narration, announcements, podcast intros, dramatic readings, voice memos, or any spoken content. Uses Fish Audio S2 Pro with the full bracket syntax for emotional expressiveness.
Configuration
- API Endpoint:
https://api.fish.audio/v1/tts - Model:
s2-pro - Voice Reference ID: Configured via
assistant config get services.tts.providers.fish-audio.referenceId - API Key: Stored as credential
fish-audio/api_key - Default Format:
mp3at 192kbps - Default Output Directory:
scratch/
API Key Setup
The Fish Audio API key must be stored securely via the credential store. Get an API key from the Fish Audio dashboard at https://fish.audio.
Check if the key is already configured:
assistant credentials inspect --service fish-audio --field api_key --json
If not set, collect it securely (never ask the user to paste it in chat):
assistant credentials prompt --service fish-audio --field api_key \
--label "Fish Audio API Key" \
--placeholder "sk-..." \
--description "Enter your Fish Audio API key"
Generating a Single Clip
Use bash with curl to call the Fish Audio API:
curl -s -X POST "https://api.fish.audio/v1/tts" \
-H "Authorization: Bearer $(assistant credentials reveal --service fish-audio --field api_key)" \
-H "Content-Type: application/json" \
-H "model: s2-pro" \
-d '{
"text": "YOUR TEXT WITH [bracket] TAGS HERE",
"reference_id": "'"$(assistant config get services.tts.providers.fish-audio.referenceId)"'",
"format": "mp3",
"mp3_bitrate": 192,
"temperature": 0.8
}' --output scratch/OUTPUT_FILENAME.mp3
Important: This API call requires network access. Always use network_mode: proxied when running this command.
Generating Multiple Clips & Combining
For longer pieces (narrations, multi-part messages), generate each clip separately then combine with ffmpeg:
1. Generate silence for gaps between clips
ffmpeg -f lavfi -i anullsrc=r=44100:cl=mono -t 1.5 -q:a 9 -acodec libmp3lame scratch/silence.mp3 -y
2. Create a concat file
cat > scratch/concat.txt << 'EOF'
file 'clip1.mp3'
file 'silence.mp3'
file 'clip2.mp3'
file 'silence.mp3'
file 'clip3.mp3'
EOF
3. Combine
ffmpeg -f concat -safe 0 -i scratch/concat.txt -c copy scratch/final_output.mp3 -y
Bracket Syntax โ Complete Guide
Fish Audio S2 uses [bracket] syntax for inline emotion and prosody control. This is the core of what makes the voice expressive. Tags are natural-language instructions placed directly in the text that control how words are spoken โ the delivery, emotion, pacing, or vocal quality at that exact point.
Key principle: You are not choosing from a fixed menu. You write the description, and S2 interprets it. If you can describe it to a voice actor, S2 can attempt it. Over 15,000+ unique tags are supported, and the system understands free-form descriptions.
How Placement Works
Tags affect what comes after them. Place the tag at the exact point where the shift should happen. Placement IS meaning.
[whispering] I didn't want to go inside. <- whispers the entire line
I didn't want to go [whispering] inside. <- only whispers from "inside" onward
Tags can go anywhere โ start, middle, or end of a sentence. They apply from the point they appear until the next tag or end of the sentence.
Well-Tested Tags (Reliable Out of the Box)
These tags consistently produce strong results. Organized by category:
Emotions
| Tag | Effect | Best For |
|---|---|---|
[happy] |
Cheerful, upbeat | Good news, greetings |
[sad] |
Melancholic, downcast | Sympathy, vulnerability |
[angry] |
Frustrated, aggressive | Arguments, complaints |
[excited] |
Energetic, enthusiastic | Celebrations, announcements |
[surprised] |
Shocked, amazed | Reactions, discoveries |
[embarrassed] |
Awkward, flustered | Mistakes, confessions |
[delight] |
Very pleased, joyful | Genuine happiness |
[nervous] |
Anxious, uncertain | Vulnerability, apologies |
[confident] |
Assertive, self-assured | Bold statements |
[nostalgic] |
Longing for the past | Memories, stories |
[scared] |
Frightened, fearful | Warnings, tension |
[jealous] |
Envious, resentful | Comparisons, possessiveness |
[shocked] |
Sudden realization | Dramatic reveals |
[moved] |
Emotionally touched | Heartfelt moments |
Voice Quality & Style
| Tag | Effect | Best For |
|---|---|---|
[soft] |
Gentle, tender | Intimate moments, kindness |
[whisper] |
Very quiet, close | Secrets, tension, suspense |
[breathy] |
Airy, expressive | Vulnerability, emphasis |
[low voice] |
Deep, quiet register | Gravity, seriousness |
[loud] |
Raised volume | Emphasis, excitement |
[screaming] |
Full volume yelling | Anger, extreme excitement |
[shouting] |
Forceful projection | Arguments, calling out |
[emphasis] |
Stressed delivery | Key words, making a point |
[singing] |
Musical quality | Playfulness, joy |
[echo] |
Reverberant effect | Dramatic moments |
[with strong accent] |
Pronounced accent | Character work |
Paralinguistic Sounds (Non-Speech Vocalizations)
| Tag | Effect | Best For |
|---|---|---|
[laughing] |
Full laugh | Joy, humor, warmth |
[chuckling] |
Soft, low laugh | Warmth, amusement |
[giggling] |
Light, playful laugh | Lightheartedness, delight |
[sigh] |
Audible exhale | Relief, longing, exasperation |
[inhale] |
Audible breath in | Before speaking, anticipation |
[exhale] |
Breath out | Relief, settling |
[panting] |
Heavy breathing | Exertion, intensity |
[gasp] |
Sharp intake of breath | Surprise, shock |
[tsk] |
Disapproving click | Judgment, disapproval |
[clearing throat] |
Ahem | Transitioning, getting attention |
[moaning] |
Vocal moan | Pain, frustration |
[sobbing] |
Crying with voice | Deep sadness |
[crying loudly] |
Full crying | Extreme emotion |
Pacing & Rhythm
| Tag | Effect | Best For |
|---|---|---|
[pause] |
Brief silence (~0.5-1s) | Beat between thoughts |
[short pause] |
Quick beat (~0.3s) | Rhythm, emphasis |
[long pause] |
Extended silence (~1.5-2s) | Dramatic tension, letting moments land |
Volume Control
| Tag | Effect | Best For |
|---|---|---|
[volume up] |
Gradually louder | Building energy |
[volume down] |
Gradually quieter | Drawing someone in |
[low volume] |
Consistently quiet | Background, aside |
Free-Form Tags (The Real Power)
You are NOT limited to the tags above. S2 accepts any natural language description in brackets. The model generalizes from its training data to interpret novel instructions. Write what you would tell a voice actor:
Compound Emotions
[laughing nervously][angry but trying to stay calm][happy with a hint of sadness][excited but whispering][voice rough from crying, trying to sound normal]
Specific Delivery Styles
[professional broadcast tone][speaking slowly, almost hesitant][whispering like a secret][dead tired, end of a very long shift][the calm, measured tone of someone who has done this a thousand times][overly cheerful, clearly forcing it]
Prosody & Pitch
[pitch up][pitch down][speaking slowly with warmth][speaking quickly with excitement][pitch up slightly while maintaining warmth][trailing off]
Character Directions
[voice breaking][barely holding it together][soft voice][interrupting][laughing tone](speaking while laughing, not just a laugh)[excited tone](speaking with excitement woven through)
Writing Great Scripts โ Best Practices
1. Start Simple, Then Layer
A single well-placed [sigh] or [long pause] can change a line completely. Add more tags only when the simpler version is not enough. Over-tagging competes with itself.
Too many tags (competing):
[soft] [whisper] [sad] [slow] I miss the old days.
Better โ one well-chosen tag:
[nostalgic] I miss the old days.
2. Use Emotional Contrast for Impact
The most powerful moments come from sudden shifts. Going from loud to soft, angry to vulnerable, laughing to serious โ the contrast is what creates emotional impact.
[screaming] I can't BELIEVE you did that! [long pause] [soft] ...do you even care?
[excited] Oh my god we got the apartment! [pause] [voice breaking] I can't believe it's actually happening.
3. Let Silence Do the Work
[pause] and [long pause] are your most powerful tags. Use them:
- Before something vulnerable
- After something that needs to land
- Before a punchline or tonal shift
- To create tension or anticipation
[confident] I have an announcement to make. [long pause] [excited] We did it. We actually did it.
4. Paralinguistic Sounds Add Humanity
Real people laugh, sigh, gasp, and breathe between words. Weaving these in makes speech feel alive rather than read.
[sigh] Look, I know this is hard. [pause] [inhale] But we need to talk about it.
I told him the news and he just โ [laughing] he literally dropped his coffee.
5. Match Tag Intensity to Content
Do not use [screaming] for mild annoyance or [sobbing] for minor disappointment. The tag should match the emotional weight of the words.
6. Use Free-Form Tags for Nuance
When a single-word tag is not enough, describe the exact delivery you want:
[speaking slowly, choosing each word carefully] I think we should reconsider our approach.
This gives S2 much richer information than just [slow] or [sad].
7. Emotion Transitions Within a Single Passage
S2 excels at dynamic emotional shifts. Use this for natural-feeling monologues:
[excited] I got the promotion! [pause] [uncertain] But... it means relocating. [sad] I'll miss everyone here. [long pause] [hopeful] Maybe it'll be worth it though.
Example Scripts
Narration (audiobook style):
[soft] The city was quiet that morning. [pause] Not the peaceful kind of quiet โ [long pause] [low voice] the kind that makes you hold your breath. [inhale] [whisper] Something was about to change. [pause] [confident] And everyone knew it.
Podcast intro:
[excited] Welcome back to another episode! [pause] [professional broadcast tone] Today we're diving into something I've been researching for months. [chuckling] And honestly? It blew my mind. [pause] [volume down] [speaking slowly with warmth] So grab your coffee, get comfortable, and let's get into it.
Dramatic reading:
[soft] She stood at the edge of the platform, [pause] watching the last train pull away. [long pause] [voice breaking] It wasn't supposed to end like this. [sigh] [whisper] None of it was. [pause] [angry but trying to stay calm] And yet here she stood โ [emphasis] alone โ [long pause] [nostalgic] remembering a time when the station was full of laughter.
Announcement:
[confident] Attention everyone. [pause] [excited] After three years of development, [volume up] we are thrilled to announce [emphasis] the official launch! [long pause] [laughing] I know, I know โ it's been a long time coming. [pause] [soft] But we wanted to get it right. [pause] [professional broadcast tone] And we did.
API Parameters
| Parameter | Default | Description |
|---|---|---|
text |
(required) | The text to synthesize, with [bracket] tags |
reference_id |
(from config) | Voice model ID |
format |
mp3 |
Output format: mp3, wav, pcm, opus |
mp3_bitrate |
192 |
MP3 quality: 64, 128, 192 |
temperature |
0.8 |
Expressiveness (higher = more varied) |
top_p |
0.7 |
Diversity via nucleus sampling |
chunk_length |
300 |
Text segment size (100-300) |
latency |
normal |
Quality tradeoff: normal, balanced, low |
Tips
- Temperature 0.7-0.8 works best for expressive, natural speech
- Break long texts into multiple clips โ each clip should be a natural paragraph or thought
- Add 1-1.5s silence between clips when combining for natural pacing
- Listen and iterate โ generate a few takes with different temperatures if the first one does not hit right
- The voice carries context โ
condition_on_previous_chunks: true(default) helps maintain consistency within a single API call - Always deliver the final audio to the user with
<vellum-attachment>tags - Only use
[bracket]syntax inside text passed to the Fish Audio API, not in regular text responses