doubao-tts - SKILL.md Agent Skill

name: doubao-tts description: Generate high-quality speech audio using Doubao (豆包/Volcengine) TTS API. Use this skill when the user asks to generate audio, podcasts, voiceovers, or text-to-speech output.

Doubao TTS — 豆包语音合成

Generate high-quality speech audio from text using Volcengine's Doubao TTS API. Supports short-form (real-time) and long-form (async, up to 100K characters) synthesis.

When to Use

User asks to generate audio, podcasts, voiceovers, or narration
User wants text-to-speech for any content
User asks to "read this aloud" or "make an audio version"

Quick Usage

Use the doubao-tts CLI tool (installed at bin/doubao-tts):

# Short text (real-time, < 300 chars)
bin/doubao-tts "你好世界" -o output.mp3

# Long text from file (async mode, up to 100K chars)
bin/doubao-tts -f article.txt -o podcast.mp3

# Pipe content
echo "Hello world" | bin/doubao-tts -o hello.mp3

# Choose voice
bin/doubao-tts "你好" -v zh_male_aojiaobazong_moon_bigtts -o output.mp3

# Adjust speed/volume/pitch
bin/doubao-tts "你好" --speed 1.2 --volume 1.5 -o output.mp3

Available Voices (已验证可用)

Chinese Female

Voice ID	Description
`zh_female_sajiaonvyou_moon_bigtts`	撒娇女友 (default)
`zh_female_gaolengyujie_moon_bigtts`	高冷御姐
`zh_female_tianmeixiaoyuan_moon_bigtts`	甜美校园
`zh_female_yuanqinvyou_moon_bigtts`	元气女友
`zh_female_wanwanxiaohe_moon_bigtts`	弯弯小何
`zh_female_linjianvhai_moon_bigtts`	邻家女孩

Chinese Male

Voice ID	Description
`zh_male_aojiaobazong_moon_bigtts`	傲娇霸总
`zh_male_jingqiangkanye_moon_bigtts`	京腔侃爷
`zh_male_wennuanahu_moon_bigtts`	温暖阿虎
`zh_male_yangguangqingnian_moon_bigtts`	阳光青年

Note: 其他音色 (BV系列, mars后缀) 需要不同的 resource ID。如需更多音色，请在火山引擎控制台开通对应资源。

API Details

Environment Variables (already configured in MetaBot .env)

VOLCENGINE_TTS_APPID=<app_id>
VOLCENGINE_TTS_ACCESS_KEY=<access_key>
VOLCENGINE_TTS_RESOURCE_ID=volc.service_type.10029  (optional)

Short-form API (real-time, < 300 chars)

Endpoint: https://openspeech.bytedance.com/api/v3/tts/unidirectional
Response: chunked JSON with base64 audio in data field
Latency: < 1 second

Long-form API (async, up to 100K chars)

Submit: POST https://openspeech.bytedance.com/api/v1/tts_async/submit
Query: GET https://openspeech.bytedance.com/api/v1/tts_async/query?appid=X&task_id=Y
Response: audio_url (valid for 1 hour)
Latency: seconds to minutes depending on text length

Workflow for Podcasts

Write the script — Create the podcast script as markdown or plain text
Generate audio — Use bin/doubao-tts -f script.txt -v zh_male_aojiaobazong_moon_bigtts -o podcast.mp3
Copy to outputs — cp podcast.mp3 /tmp/metabot-outputs/<chatId>/ to send to user
For multi-voice podcasts, generate each speaker's segments separately, then concatenate with ffmpeg

Multi-Voice Podcast Example

# Generate segments for different speakers
bin/doubao-tts -f host_lines.txt -v zh_male_aojiaobazong_moon_bigtts -o host.mp3
bin/doubao-tts -f guest_lines.txt -v zh_female_gaolengyujie_moon_bigtts -o guest.mp3

# Concatenate (requires ffmpeg)
echo "file 'host.mp3'" > list.txt
echo "file 'guest.mp3'" >> list.txt
ffmpeg -f concat -safe 0 -i list.txt -c copy podcast.mp3

Raw curl (if CLI not available)

# Short-form
curl -X POST "https://openspeech.bytedance.com/api/v3/tts/unidirectional" \
  -H "Content-Type: application/json" \
  -H "X-Api-App-Id: $VOLCENGINE_TTS_APPID" \
  -H "X-Api-Access-Key: $VOLCENGINE_TTS_ACCESS_KEY" \
  -H "X-Api-Resource-Id: volc.service_type.10029" \
  -H "X-Api-Request-Id: $(uuidgen)" \
  -d '{
    "req_params": {
      "text": "你好世界",
      "speaker": "zh_female_sajiaonvyou_moon_bigtts",
      "audio_params": {"format": "mp3", "sample_rate": 24000}
    }
  }' | python3 -c "
import sys, json, base64
chunks = []
for line in sys.stdin:
    line = line.strip()
    if not line: continue
    try:
        d = json.loads(line)
        if d.get('data'): chunks.append(base64.b64decode(d['data']))
    except: pass
sys.stdout.buffer.write(b''.join(chunks))
" > output.mp3