name: audio-transcription description: Transcribe voice messages and audio files using Gemini's multimodal API. Automatically handles OGG/Opus format, extracts text content, and provides actionable summaries. metadata: openclaw: emoji: 🎙️ requires: envVars: - GEMINI_API_KEY
Audio Transcription Skill
Transcribes voice messages and audio files using Gemini's multimodal API.
When to Use
- User sends a voice message (.ogg file)
- User asks to transcribe an audio file
- User wants to know what was said in a voice memo
Quick Start
# Transcribe an audio file
~/.openclaw/skills/audio-transcription/bin/transcribe_audio /path/to/audio.ogg
# Get JSON output with metadata
~/.openclaw/skills/audio-transcription/bin/transcribe_audio --json /path/to/audio.ogg
How It Works
- Receives audio file (e.g., .ogg voice message)
- Converts to base64 inline
- Sends to Gemini 2.5 Flash API
- Returns transcription with optional summary
Supported Formats
| Format | MIME Type | Notes |
|---|---|---|
| OGG/Opus | audio/ogg |
Voice messages (Discord, Telegram, etc.) |
| MP3 | audio/mpeg |
Standard audio |
| WAV | audio/wav |
Uncompressed audio |
| FLAC | audio/flac |
Lossless audio |
| M4A | audio/mp4 |
Apple audio format |
| WEBM | audio/webm |
Web audio |
Usage in Conversations
When you receive a voice message:
- The audio file is saved to
~/.openclaw/media/inbound/ - Run the transcription script:
~/.openclaw/skills/audio-transcription/bin/transcribe_audio <path> - Parse the transcription and respond to the user's request
- Take action on their behalf if requested
Example
# Transcribe a voice message
TRANSCRIPT=$(~/.openclaw/skills/audio-transcription/bin/transcribe_audio \
~/.openclaw/media/inbound/abc123.ogg)
echo "User said: $TRANSCRIPT"
Response Format
The script returns the raw transcription text. For structured output:
~/.openclaw/skills/audio-transcription/bin/transcribe_audio --json <path>
Returns:
{
"transcription": "...",
"model": "gemini-2.5-flash",
"tokenCount": 697
}
API Details
- Model:
gemini-2.5-flash - API: Google Generative AI (generativelanguage.googleapis.com)
- Auth: API key from
GEMINI_API_KEYenvironment variable - Limits: See https://ai.google.dev/pricing
Troubleshooting
"API key not valid"
- Ensure
GEMINI_API_KEYis set in your environment - Check the key hasn't expired
"Audio format not supported"
- Gemini supports OGG/Opus natively (common format for voice messages)
- For other formats, ensure the mime type is correct
"Argument list too long" (exit 126)
- Fixed in current version: JSON is now built via temp files to avoid the 128KB kernel
MAX_ARG_STRLENper-argument limit. Should not occur for files under 20MB.
"File too large"
- Gemini has a 20MB limit for inline audio
- For larger files, use the File Upload API first
Files
bin/transcribe_audio- Main transcription scriptSKILL.md- This documentation