name: whisper-transcribe description: "Transcribe audio and video files to text using the Whisper speech-to-text API at {{WHISPER_HOST}}:{{WHISPER_PORT}}." metadata: openclaw: emoji: "🎙️"
Whisper Transcribe Skill
Whisper speech-to-text service is available at http://{{WHISPER_HOST}}:{{WHISPER_PORT}} within the Docker network.
Basic Transcription
Transcribe an audio file to text:
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "output=json"
Transcription with Options
Specify language, task, and output format:
# Transcribe with language hint
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "language=en" \
-F "output=json"
# Translate non-English audio to English
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/french_audio.mp3" \
-F "task=translate" \
-F "output=json"
Output Formats
Choose between different output formats:
# Plain text output
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "output=txt"
# JSON with timestamps
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "output=json"
# SRT subtitles
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "output=srt"
# VTT subtitles
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "output=vtt"
# Verbose JSON with word-level timestamps
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "output=verbose_json" \
-F "word_timestamps=true"
Preparing Audio with FFmpeg
For best results, convert audio to the optimal format before sending to Whisper:
# Convert any media to 16kHz mono WAV (optimal for Whisper)
ffmpeg -i /data/input/video.mp4 \
-vn -ar 16000 -ac 1 -c:a pcm_s16le \
/data/input/audio_for_whisper.wav
# Extract and convert specific time range
ffmpeg -i /data/input/long_recording.mp3 \
-ss 00:05:00 -to 00:10:00 \
-vn -ar 16000 -ac 1 -c:a pcm_s16le \
/data/input/segment.wav
Saving Transcription Output
# Save transcription to file
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "output=json" \
-o /data/output/transcription.json
# Save subtitles
curl -X POST "http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/asr" \
-F "audio_file=@/data/input/recording.wav" \
-F "output=srt" \
-o /data/output/subtitles.srt
Response Structure (JSON)
{
"text": "The full transcription text goes here.",
"segments": [
{
"id": 0,
"start": 0.0,
"end": 4.5,
"text": "The full transcription",
"avg_logprob": -0.25,
"no_speech_prob": 0.01
},
{
"id": 1,
"start": 4.5,
"end": 7.2,
"text": " text goes here.",
"avg_logprob": -0.18,
"no_speech_prob": 0.02
}
],
"language": "en"
}
Supported Audio Formats
- WAV (recommended for best quality)
- MP3
- FLAC
- OGG
- M4A
- WebM
Tips for AI Agents
- Always convert input audio to 16kHz mono WAV for the best accuracy and fastest processing.
- Use
languageparameter when you know the source language to improve accuracy and speed. - Use
task=translateto translate any language directly to English text. - For long audio (>30 minutes), split into segments with FFmpeg first, then transcribe each segment.
- Check
no_speech_probin segments to filter out silence or noise sections. - Use
output=verbose_jsonwithword_timestamps=truewhen you need precise timing for subtitle sync. - SRT/VTT output formats are ready to use as video subtitles without additional processing.
- Check Whisper service health at
http://{{WHISPER_HOST}}:{{WHISPER_PORT}}/or/health.