name: transcribe description: "Transcribe video and audio files via Gemini API. Use when the user asks to transcribe a recording, generate a meeting summary, extract speech from video or audio, or convert speech to text. Supports mp4, mkv, webm, avi, mov, mp3, wav, ogg, m4a, flac."
transcribe — video and audio transcription
Transcribes audio and video files via the Gemini 2.5 Flash API.
Modes
Generic (default)
Verbatim speech transcription with timecodes [MM:SS] and speaker identification.
Output files:
<name> - transcript.md— verbatim speech with timecodes<name> - summary.md— short summary (with the--with-summaryflag)
UI analysis (--analyze-ui, video only)
Detailed video analysis with breakdown of on-screen interface, navigation, and actions, plus screenshots.
Output files:
<name> - summary.md— short overview (topic, participants, decisions)<name> - detailed.md— step-by-step chronological analysis with screen content and screenshots<name> - transcript.md— verbatim speech with timecodesscreenshots/— PNG frames of key moments
Usage
/transcribe <FilePath> [--output-dir DIR] [--analyze-ui] [--with-summary] [--format md|txt]
| Parameter | Required | Default | Description |
|---|---|---|---|
FilePath |
yes | — | Path to audio / video file |
--output-dir |
no | <file_folder>/Transcript/<name>/ |
Result directory |
--analyze-ui |
no | off | UI-analysis mode (video only) |
--with-summary |
no | off | Add a summary (for generic mode) |
--format |
no | md |
Output format: md or txt |
Supported formats
- Video: mp4, mkv, webm, avi, mov
- Audio: mp3, wav, ogg, m4a, flac, aac, wma
Dependencies
- Python packages:
google-genai,python-dotenv - System:
ffmpeg,ffprobein PATH - API key:
~/.cursor/skills/transcribe/.env(or your tool's skill folder) withGEMINI_API_KEY=...
Procedure
Determine
FilePathand optional flags from the user's arguments.Run the script:
PYTHONUNBUFFERED=1 python <skill-dir>/scripts/transcribe.py "<FilePath>" [--output-dir "<OutputDir>"] [--analyze-ui] [--with-summary] [--format md|txt]
IMPORTANT: PYTHONUNBUFFERED=1 is mandatory; otherwise stdout is buffered and progress is not displayed.
The script runs long (5–15 minutes depending on length). Files > 1 hour are split automatically.
After completion, report the result paths to the user.
Read the summary (if any) or the start of the transcript and show it to the user.
Cost
~$0.10 per 1 hour of recording (Gemini 2.5 Flash). Long files cost proportionally.
Limitations
- Maximum ~1 hour per Gemini request (the script splits automatically)
- Cyrillic file names: the script automatically copies the file to a temp location with an ASCII name
- Screenshot quality depends on source video quality
- Timecode accuracy +/- a few seconds
--analyze-uiwith an audio file automatically falls back to generic + summary