fish-audio - SKILL.md Agent Skill

name: fish-audio description: "Generate expressive audio clips using Fish Audio S2 TTS with bracket emotion tags. Record voice memos, narration, audio messages, or any spoken content." compatibility: "Designed for Vellum personal assistants" metadata: emoji: "🎙️" vellum: category: "voice" display-name: "Fish Audio TTS"

Fish Audio TTS

Generate expressive audio clips using the Fish Audio S2 TTS API with [bracket] emotion tags.

Overview

This skill lets you create audio clips on demand — narration, announcements, podcast intros, dramatic readings, voice memos, or any spoken content. Uses Fish Audio S2 Pro with the full bracket syntax for emotional expressiveness.

Configuration

API Endpoint: https://api.fish.audio/v1/tts
Model: s2-pro
Voice Reference ID: Configured via assistant config get services.tts.providers.fish-audio.referenceId
API Key: Stored as credential fish-audio/api_key
Default Format: mp3 at 192kbps
Default Output Directory: scratch/

API Key Setup

The Fish Audio API key must be stored securely via the credential store. Get an API key from the Fish Audio dashboard at https://fish.audio.

Check if the key is already configured:

assistant credentials inspect --service fish-audio --field api_key --json

If not set, collect it securely (never ask the user to paste it in chat):

assistant credentials prompt --service fish-audio --field api_key \
  --label "Fish Audio API Key" \
  --placeholder "sk-..." \
  --description "Enter your Fish Audio API key"

Generating a Single Clip

Use bash with curl to call the Fish Audio API:

curl -s -X POST "https://api.fish.audio/v1/tts" \
  -H "Authorization: Bearer $(assistant credentials reveal --service fish-audio --field api_key)" \
  -H "Content-Type: application/json" \
  -H "model: s2-pro" \
  -d '{
    "text": "YOUR TEXT WITH [bracket] TAGS HERE",
    "reference_id": "'"$(assistant config get services.tts.providers.fish-audio.referenceId)"'",
    "format": "mp3",
    "mp3_bitrate": 192,
    "temperature": 0.8
  }' --output scratch/OUTPUT_FILENAME.mp3

Important: This API call requires network access. Always use network_mode: proxied when running this command.

Generating Multiple Clips & Combining

For longer pieces (narrations, multi-part messages), generate each clip separately then combine with ffmpeg:

1. Generate silence for gaps between clips

ffmpeg -f lavfi -i anullsrc=r=44100:cl=mono -t 1.5 -q:a 9 -acodec libmp3lame scratch/silence.mp3 -y

2. Create a concat file

cat > scratch/concat.txt << 'EOF'
file 'clip1.mp3'
file 'silence.mp3'
file 'clip2.mp3'
file 'silence.mp3'
file 'clip3.mp3'
EOF

3. Combine

ffmpeg -f concat -safe 0 -i scratch/concat.txt -c copy scratch/final_output.mp3 -y

Bracket Syntax — Complete Guide

Fish Audio S2 uses [bracket] syntax for inline emotion and prosody control. This is the core of what makes the voice expressive. Tags are natural-language instructions placed directly in the text that control how words are spoken — the delivery, emotion, pacing, or vocal quality at that exact point.

Key principle: You are not choosing from a fixed menu. You write the description, and S2 interprets it. If you can describe it to a voice actor, S2 can attempt it. Over 15,000+ unique tags are supported, and the system understands free-form descriptions.

How Placement Works

Tags affect what comes after them. Place the tag at the exact point where the shift should happen. Placement IS meaning.

[whispering] I didn't want to go inside.     <- whispers the entire line
I didn't want to go [whispering] inside.     <- only whispers from "inside" onward

Tags can go anywhere — start, middle, or end of a sentence. They apply from the point they appear until the next tag or end of the sentence.

Well-Tested Tags (Reliable Out of the Box)

These tags consistently produce strong results. Organized by category:

Emotions

Tag	Effect	Best For
`[happy]`	Cheerful, upbeat	Good news, greetings
`[sad]`	Melancholic, downcast	Sympathy, vulnerability
`[angry]`	Frustrated, aggressive	Arguments, complaints
`[excited]`	Energetic, enthusiastic	Celebrations, announcements
`[surprised]`	Shocked, amazed	Reactions, discoveries
`[embarrassed]`	Awkward, flustered	Mistakes, confessions
`[delight]`	Very pleased, joyful	Genuine happiness
`[nervous]`	Anxious, uncertain	Vulnerability, apologies
`[confident]`	Assertive, self-assured	Bold statements
`[nostalgic]`	Longing for the past	Memories, stories
`[scared]`	Frightened, fearful	Warnings, tension
`[jealous]`	Envious, resentful	Comparisons, possessiveness
`[shocked]`	Sudden realization	Dramatic reveals
`[moved]`	Emotionally touched	Heartfelt moments

Voice Quality & Style

Tag	Effect	Best For
`[soft]`	Gentle, tender	Intimate moments, kindness
`[whisper]`	Very quiet, close	Secrets, tension, suspense
`[breathy]`	Airy, expressive	Vulnerability, emphasis
`[low voice]`	Deep, quiet register	Gravity, seriousness
`[loud]`	Raised volume	Emphasis, excitement
`[screaming]`	Full volume yelling	Anger, extreme excitement
`[shouting]`	Forceful projection	Arguments, calling out
`[emphasis]`	Stressed delivery	Key words, making a point
`[singing]`	Musical quality	Playfulness, joy
`[echo]`	Reverberant effect	Dramatic moments
`[with strong accent]`	Pronounced accent	Character work

Paralinguistic Sounds (Non-Speech Vocalizations)

Tag	Effect	Best For
`[laughing]`	Full laugh	Joy, humor, warmth
`[chuckling]`	Soft, low laugh	Warmth, amusement
`[giggling]`	Light, playful laugh	Lightheartedness, delight
`[sigh]`	Audible exhale	Relief, longing, exasperation
`[inhale]`	Audible breath in	Before speaking, anticipation
`[exhale]`	Breath out	Relief, settling
`[panting]`	Heavy breathing	Exertion, intensity
`[gasp]`	Sharp intake of breath	Surprise, shock
`[tsk]`	Disapproving click	Judgment, disapproval
`[clearing throat]`	Ahem	Transitioning, getting attention
`[moaning]`	Vocal moan	Pain, frustration
`[sobbing]`	Crying with voice	Deep sadness
`[crying loudly]`	Full crying	Extreme emotion

Pacing & Rhythm

Tag	Effect	Best For
`[pause]`	Brief silence (~0.5-1s)	Beat between thoughts
`[short pause]`	Quick beat (~0.3s)	Rhythm, emphasis
`[long pause]`	Extended silence (~1.5-2s)	Dramatic tension, letting moments land

Volume Control

Tag	Effect	Best For
`[volume up]`	Gradually louder	Building energy
`[volume down]`	Gradually quieter	Drawing someone in
`[low volume]`	Consistently quiet	Background, aside

Free-Form Tags (The Real Power)

You are NOT limited to the tags above. S2 accepts any natural language description in brackets. The model generalizes from its training data to interpret novel instructions. Write what you would tell a voice actor:

Compound Emotions

[laughing nervously]
[angry but trying to stay calm]
[happy with a hint of sadness]
[excited but whispering]
[voice rough from crying, trying to sound normal]

Specific Delivery Styles

[professional broadcast tone]
[speaking slowly, almost hesitant]
[whispering like a secret]
[dead tired, end of a very long shift]
[the calm, measured tone of someone who has done this a thousand times]
[overly cheerful, clearly forcing it]

Prosody & Pitch

[pitch up]
[pitch down]
[speaking slowly with warmth]
[speaking quickly with excitement]
[pitch up slightly while maintaining warmth]
[trailing off]

Character Directions

[voice breaking]
[barely holding it together]
[soft voice]
[interrupting]
[laughing tone] (speaking while laughing, not just a laugh)
[excited tone] (speaking with excitement woven through)

Writing Great Scripts — Best Practices

1. Start Simple, Then Layer

A single well-placed [sigh] or [long pause] can change a line completely. Add more tags only when the simpler version is not enough. Over-tagging competes with itself.

Too many tags (competing):

[soft] [whisper] [sad] [slow] I miss the old days.

Better — one well-chosen tag:

[nostalgic] I miss the old days.

2. Use Emotional Contrast for Impact

The most powerful moments come from sudden shifts. Going from loud to soft, angry to vulnerable, laughing to serious — the contrast is what creates emotional impact.

[screaming] I can't BELIEVE you did that! [long pause] [soft] ...do you even care?

[excited] Oh my god we got the apartment! [pause] [voice breaking] I can't believe it's actually happening.

3. Let Silence Do the Work

[pause] and [long pause] are your most powerful tags. Use them:

Before something vulnerable
After something that needs to land
Before a punchline or tonal shift
To create tension or anticipation

[confident] I have an announcement to make. [long pause] [excited] We did it. We actually did it.

4. Paralinguistic Sounds Add Humanity

Real people laugh, sigh, gasp, and breathe between words. Weaving these in makes speech feel alive rather than read.

[sigh] Look, I know this is hard. [pause] [inhale] But we need to talk about it.

I told him the news and he just — [laughing] he literally dropped his coffee.

5. Match Tag Intensity to Content

Do not use [screaming] for mild annoyance or [sobbing] for minor disappointment. The tag should match the emotional weight of the words.

6. Use Free-Form Tags for Nuance

When a single-word tag is not enough, describe the exact delivery you want:

[speaking slowly, choosing each word carefully] I think we should reconsider our approach.

This gives S2 much richer information than just [slow] or [sad].

7. Emotion Transitions Within a Single Passage

S2 excels at dynamic emotional shifts. Use this for natural-feeling monologues:

[excited] I got the promotion! [pause] [uncertain] But... it means relocating. [sad] I'll miss everyone here. [long pause] [hopeful] Maybe it'll be worth it though.

Example Scripts

Narration (audiobook style):

[soft] The city was quiet that morning. [pause] Not the peaceful kind of quiet — [long pause] [low voice] the kind that makes you hold your breath. [inhale] [whisper] Something was about to change. [pause] [confident] And everyone knew it.

Podcast intro:

[excited] Welcome back to another episode! [pause] [professional broadcast tone] Today we're diving into something I've been researching for months. [chuckling] And honestly? It blew my mind. [pause] [volume down] [speaking slowly with warmth] So grab your coffee, get comfortable, and let's get into it.

Dramatic reading:

[soft] She stood at the edge of the platform, [pause] watching the last train pull away. [long pause] [voice breaking] It wasn't supposed to end like this. [sigh] [whisper] None of it was. [pause] [angry but trying to stay calm] And yet here she stood — [emphasis] alone — [long pause] [nostalgic] remembering a time when the station was full of laughter.

Announcement:

[confident] Attention everyone. [pause] [excited] After three years of development, [volume up] we are thrilled to announce [emphasis] the official launch! [long pause] [laughing] I know, I know — it's been a long time coming. [pause] [soft] But we wanted to get it right. [pause] [professional broadcast tone] And we did.

API Parameters

Parameter	Default	Description
`text`	(required)	The text to synthesize, with [bracket] tags
`reference_id`	(from config)	Voice model ID
`format`	`mp3`	Output format: `mp3`, `wav`, `pcm`, `opus`
`mp3_bitrate`	`192`	MP3 quality: `64`, `128`, `192`
`temperature`	`0.8`	Expressiveness (higher = more varied)
`top_p`	`0.7`	Diversity via nucleus sampling
`chunk_length`	`300`	Text segment size (100-300)
`latency`	`normal`	Quality tradeoff: `normal`, `balanced`, `low`

Tips

Temperature 0.7-0.8 works best for expressive, natural speech
Break long texts into multiple clips — each clip should be a natural paragraph or thought
Add 1-1.5s silence between clips when combining for natural pacing
Listen and iterate — generate a few takes with different temperatures if the first one does not hit right
The voice carries context — condition_on_previous_chunks: true (default) helps maintain consistency within a single API call
Always deliver the final audio to the user with <vellum-attachment> tags
Only use [bracket] syntax inside text passed to the Fish Audio API, not in regular text responses