name: mk:multimodal description: Process images, video, audio, PDFs with Gemini API. Generate images (Nano Banana 2), videos (Veo 3), speech (MiniMax TTS), music (MiniMax). Convert documents to Markdown. Multi-provider fallback (Gemini → MiniMax → OpenRouter). Activate when task references media files, asks to analyze/describe/transcribe/extract/generate/convert, or involves non-text binary files. version: 2.1.0 argument-hint: '[file-path] [task]' phase: on-demand source: local trust_level: kit-authored injection_risk: low requires_env:
- name: MEOWKIT_GEMINI_API_KEY required: true description: Google Gemini API key (falls back to GEMINI_API_KEY) setup: Get from https://aistudio.google.com/apikey fallback: Analysis requires this key. Generation falls back to MiniMax/OpenRouter if available.
- name: MEOWKIT_MINIMAX_API_KEY required: false description: MiniMax API key for image/video/TTS/music generation setup: Get from https://platform.minimax.io/ keywords:
- multimodal
- gemini
- image-gen
- video-gen
- tts
- transcription
- document-conversion
- vision
- image
- video
- audio
- minimax when_to_use: Use when processing images, video, audio, PDFs or generating media via Gemini/MiniMax/OpenRouter. Activate when task references media files. user-invocable: true allowed-tools:
- Bash
- Read
- Write
- Edit owner: utility criticality: medium status: active runtime: claude-code requires_external_service: ["gemini", "minimax"] default_enabled: false
Multimodal Analysis & Generation
Path convention: Commands below assume cwd is
$CLAUDE_PROJECT_DIR(project root). Prefix paths with"$CLAUDE_PROJECT_DIR/"when invoking from subdirectories.
Overview
Analyze images, video, audio, and documents — or generate images, videos, speech, and music — using Gemini and MiniMax APIs. Multi-provider fallback: Gemini → MiniMax → OpenRouter. All routing and models configurable via .env.
Models & Limits
Model IDs (verified 2026-04-22; provider model roster changes — re-verify via provider dashboards before relying on specific preview IDs):
- Analysis:
gemini-2.5-flash(stable default) - Image generation:
gemini-3.1-flash-image-preview(Nano Banana 2, preview-channel) - Video generation:
veo-3.1-generate-preview(Gemini, preview-channel) orMiniMax-Hailuo-2.3 - TTS:
speech-2.8-hd - Music:
music-2.6
Preview-channel IDs may rotate without notice. Full live table with pricing: references/models-and-pricing.md.
MiniMax models require MEOWKIT_MINIMAX_API_KEY: image-01 (images), MiniMax-Hailuo-2.3 (video), speech-2.8-hd (TTS), music-2.6 (music).
Scripts
scripts/gemini_analyze.py — Analyze media files
.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/gemini_analyze.py \
--files <path> --task <analyze|transcribe|extract> [--resolution low-res] [--json] [--verbose]
scripts/gemini_generate.py — Generate images/videos (Gemini, with --provider for fallback)
.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/gemini_generate.py \
--task <generate-image|generate-video> --prompt "description" [--provider minimax] [--json]
scripts/minimax_generate.py — MiniMax generation (image, video, TTS, music)
.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/minimax_generate.py \
--task <generate-image|generate-video|generate-speech|generate-music> --prompt "..." [--json]
scripts/document_converter.py — Convert documents to Markdown
.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/document_converter.py \
--files doc.pdf --output ./docs/ [--json] [--verbose]
scripts/check_setup.py — Verify API keys and dependencies
.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/check_setup.py
Prints PASS/FAIL status for each API key (MEOWKIT_GEMINI_API_KEY, MEOWKIT_MINIMAX_API_KEY, optional MEOWKIT_OPENROUTER_API_KEY) and core Python dependencies. Run before first use to surface missing env vars.
References
Load only when executing the corresponding step.
| Reference | When to load | Content |
|---|---|---|
| vision-understanding.md | Image analysis | Detection, segmentation, OCR |
| audio-processing.md | Audio/video transcription | Formats, splitting, timestamps |
| models-and-pricing.md | Model selection, cost | Full model table, pricing |
| image-generation.md | Image generation | Nano Banana 2, OpenRouter fallback |
| video-generation.md | Video generation | Veo 3.1, async polling |
| video-analysis.md | Video analysis | Resolution modes, cost math |
| minimax-generation.md | MiniMax generation | Image, video, TTS, music |
| document-conversion.md | Document conversion | Formats, batch mode |
When to Use
Auto-activate on these patterns:
- Task references image (.png, .jpg, .webp), video (.mp4, .mov), audio (.mp3, .wav), or document (.pdf, .docx) files
- Task asks to "analyze", "describe", "transcribe", "extract", "OCR"
- Task asks to "generate image", "generate video", "create image"
- Task asks to "generate speech", "text to speech", "TTS"
- Task asks to "generate music", "create music"
- Task asks to "convert document", "convert PDF to markdown"
- User references a non-text binary file for processing
Do NOT invoke when: text-only files (use Read), Gemini API docs (use mk:docs-finder), already-described image in context.
API Key Check
Check: is MEOWKIT_GEMINI_API_KEY (or legacy GEMINI_API_KEY) set?
- If yes → proceed with Gemini
- If no → check
MEOWKIT_MINIMAX_API_KEYfor generation tasks - If neither → STOP with setup instructions
Analysis Process
Detect media format → select model (default: gemini-2.5-flash) → run script. For >20MB files, use File API. Use --resolution low-res for video (62% savings, video only). Output capped at ~3000 tokens; prefer structured JSON/Markdown.
Generation Process
Provider router auto-selects best available provider by checking API keys:
- Image: Gemini → MiniMax → OpenRouter
- Video: Gemini → MiniMax
- TTS: MiniMax only
- Music: MiniMax only
Force a provider: --provider gemini|minimax|openrouter. Override chain order via MEOWKIT_IMAGE_PROVIDER_CHAIN etc.
Files in This Skill
mk:multimodal/
├── SKILL.md
├── references/ — per-capability reference files (8 files)
├── workflows/ — step-by-step workflow guides
│ ├── audio-transcription.md
│ ├── document-extraction.md
│ └── image-analysis.md
└── scripts/ — Python modules (run via .claude/skills/.venv/bin/python3)
├── gemini_analyze.py — primary analysis driver (images, video, audio, PDFs)
├── gemini_generate.py — Gemini image/video generation with provider fallback
├── minimax_generate.py — MiniMax image/video/TTS/music generation
├── document_converter.py — PDF/docx → Markdown conversion
├── check_setup.py — API key and dependency verification
├── provider_router.py — multi-provider selection logic
├── api_key_rotator.py — Gemini key rotation on rate limits
├── minimax_api_client.py — MiniMax API client wrapper
├── openrouter_fallback.py — OpenRouter fallback for image gen
├── media_optimizer.py — pre-processing and chunking for large files
├── video_generator.py — async video polling + download helper
├── analyze_constants.py — shared model/endpoint constants
├── analyze_core.py — shared analysis logic reused by analyze scripts
└── env_utils.py — env var loading and validation helpers
Gotchas
- Python venv required: if you get
python3: command not foundor import errors, run.claude/scripts/bin/setup-workflowonce from the project root. - Audio >15 min: Gemini truncates silently. Split first.
- PDF >100 pages: Quality degrades. Process in 20-page chunks.
- Video cost: ~263 tokens/sec at default resolution. Use
--verbosefor cost estimate. - Image gen requires billing: Free tier = no gen. Use MiniMax/OpenRouter as fallback.
- MiniMax video timeout: Hailuo takes 60-180s. Max 600s.
- TTS voices: 300+ system voices (see provider catalog for live list). Default:
Wise_Woman. See minimax-generation.md. - Temperature: Keep Gemini at 1.0. Lowering causes degraded output.
Failure Handling
- Missing API key → STOP + setup instructions
- Invalid key (401) → STOP + re-check key
- Rate limit (429) → key rotation (
MEOWKIT_GEMINI_API_KEY_2/3/4), else wait 60s - Billing required → provider router tries MiniMax/OpenRouter fallback
Setup
Set MEOWKIT_GEMINI_API_KEY in env or .env file. Legacy GEMINI_API_KEY also works.
Optional: MEOWKIT_MINIMAX_API_KEY for TTS/music and alternative generation.
Budget rule: ≤3000 tokens inline. If response exceeds budget, summarize key findings before returning.