mk-multimodal - SKILL.md Agent Skill

name: mk:multimodal description: Process images, video, audio, PDFs with Gemini API. Generate images (Nano Banana 2), videos (Veo 3), speech (MiniMax TTS), music (MiniMax). Convert documents to Markdown. Multi-provider fallback (Gemini → MiniMax → OpenRouter). Activate when task references media files, asks to analyze/describe/transcribe/extract/generate/convert, or involves non-text binary files. version: 2.1.0 argument-hint: '[file-path] [task]' phase: on-demand source: local trust_level: kit-authored injection_risk: low requires_env:

name: MEOWKIT_GEMINI_API_KEY required: true description: Google Gemini API key (falls back to GEMINI_API_KEY) setup: Get from https://aistudio.google.com/apikey fallback: Analysis requires this key. Generation falls back to MiniMax/OpenRouter if available.
name: MEOWKIT_MINIMAX_API_KEY required: false description: MiniMax API key for image/video/TTS/music generation setup: Get from https://platform.minimax.io/ keywords:
multimodal
gemini
image-gen
video-gen
tts
transcription
document-conversion
vision
image
video
audio
minimax when_to_use: Use when processing images, video, audio, PDFs or generating media via Gemini/MiniMax/OpenRouter. Activate when task references media files. user-invocable: true allowed-tools:
Bash
Read
Write
Edit owner: utility criticality: medium status: active runtime: claude-code requires_external_service: ["gemini", "minimax"] default_enabled: false

Multimodal Analysis & Generation

Path convention: Commands below assume cwd is $CLAUDE_PROJECT_DIR (project root). Prefix paths with "$CLAUDE_PROJECT_DIR/" when invoking from subdirectories.

Overview

Analyze images, video, audio, and documents — or generate images, videos, speech, and music — using Gemini and MiniMax APIs. Multi-provider fallback: Gemini → MiniMax → OpenRouter. All routing and models configurable via .env.

Models & Limits

Model IDs (verified 2026-04-22; provider model roster changes — re-verify via provider dashboards before relying on specific preview IDs):

Analysis: gemini-2.5-flash (stable default)
Image generation: gemini-3.1-flash-image-preview (Nano Banana 2, preview-channel)
Video generation: veo-3.1-generate-preview (Gemini, preview-channel) or MiniMax-Hailuo-2.3
TTS: speech-2.8-hd
Music: music-2.6

Preview-channel IDs may rotate without notice. Full live table with pricing: references/models-and-pricing.md.

MiniMax models require MEOWKIT_MINIMAX_API_KEY: image-01 (images), MiniMax-Hailuo-2.3 (video), speech-2.8-hd (TTS), music-2.6 (music).

Scripts

scripts/gemini_analyze.py — Analyze media files

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/gemini_analyze.py \
  --files <path> --task <analyze|transcribe|extract> [--resolution low-res] [--json] [--verbose]

scripts/gemini_generate.py — Generate images/videos (Gemini, with --provider for fallback)

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/gemini_generate.py \
  --task <generate-image|generate-video> --prompt "description" [--provider minimax] [--json]

scripts/minimax_generate.py — MiniMax generation (image, video, TTS, music)

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/minimax_generate.py \
  --task <generate-image|generate-video|generate-speech|generate-music> --prompt "..." [--json]

scripts/document_converter.py — Convert documents to Markdown

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/document_converter.py \
  --files doc.pdf --output ./docs/ [--json] [--verbose]

scripts/check_setup.py — Verify API keys and dependencies

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/check_setup.py

Prints PASS/FAIL status for each API key (MEOWKIT_GEMINI_API_KEY, MEOWKIT_MINIMAX_API_KEY, optional MEOWKIT_OPENROUTER_API_KEY) and core Python dependencies. Run before first use to surface missing env vars.

References

Load only when executing the corresponding step.

Reference	When to load	Content
vision-understanding.md	Image analysis	Detection, segmentation, OCR
audio-processing.md	Audio/video transcription	Formats, splitting, timestamps
models-and-pricing.md	Model selection, cost	Full model table, pricing
image-generation.md	Image generation	Nano Banana 2, OpenRouter fallback
video-generation.md	Video generation	Veo 3.1, async polling
video-analysis.md	Video analysis	Resolution modes, cost math
minimax-generation.md	MiniMax generation	Image, video, TTS, music
document-conversion.md	Document conversion	Formats, batch mode

When to Use

Auto-activate on these patterns:

Task references image (.png, .jpg, .webp), video (.mp4, .mov), audio (.mp3, .wav), or document (.pdf, .docx) files
Task asks to "analyze", "describe", "transcribe", "extract", "OCR"
Task asks to "generate image", "generate video", "create image"
Task asks to "generate speech", "text to speech", "TTS"
Task asks to "generate music", "create music"
Task asks to "convert document", "convert PDF to markdown"
User references a non-text binary file for processing

Do NOT invoke when: text-only files (use Read), Gemini API docs (use mk:docs-finder), already-described image in context.

API Key Check

Check: is MEOWKIT_GEMINI_API_KEY (or legacy GEMINI_API_KEY) set?

If yes → proceed with Gemini
If no → check MEOWKIT_MINIMAX_API_KEY for generation tasks
If neither → STOP with setup instructions

Analysis Process

Detect media format → select model (default: gemini-2.5-flash) → run script. For >20MB files, use File API. Use --resolution low-res for video (62% savings, video only). Output capped at ~3000 tokens; prefer structured JSON/Markdown.

Generation Process

Provider router auto-selects best available provider by checking API keys:

Image: Gemini → MiniMax → OpenRouter
Video: Gemini → MiniMax
TTS: MiniMax only
Music: MiniMax only

Force a provider: --provider gemini|minimax|openrouter. Override chain order via MEOWKIT_IMAGE_PROVIDER_CHAIN etc.

Files in This Skill

mk:multimodal/
├── SKILL.md
├── references/               — per-capability reference files (8 files)
├── workflows/                — step-by-step workflow guides
│   ├── audio-transcription.md
│   ├── document-extraction.md
│   └── image-analysis.md
└── scripts/                  — Python modules (run via .claude/skills/.venv/bin/python3)
    ├── gemini_analyze.py     — primary analysis driver (images, video, audio, PDFs)
    ├── gemini_generate.py    — Gemini image/video generation with provider fallback
    ├── minimax_generate.py   — MiniMax image/video/TTS/music generation
    ├── document_converter.py — PDF/docx → Markdown conversion
    ├── check_setup.py        — API key and dependency verification
    ├── provider_router.py    — multi-provider selection logic
    ├── api_key_rotator.py    — Gemini key rotation on rate limits
    ├── minimax_api_client.py — MiniMax API client wrapper
    ├── openrouter_fallback.py — OpenRouter fallback for image gen
    ├── media_optimizer.py    — pre-processing and chunking for large files
    ├── video_generator.py    — async video polling + download helper
    ├── analyze_constants.py  — shared model/endpoint constants
    ├── analyze_core.py       — shared analysis logic reused by analyze scripts
    └── env_utils.py          — env var loading and validation helpers

Gotchas

Python venv required: if you get python3: command not found or import errors, run .claude/scripts/bin/setup-workflow once from the project root.
Audio >15 min: Gemini truncates silently. Split first.
PDF >100 pages: Quality degrades. Process in 20-page chunks.
Video cost: ~263 tokens/sec at default resolution. Use --verbose for cost estimate.
Image gen requires billing: Free tier = no gen. Use MiniMax/OpenRouter as fallback.
MiniMax video timeout: Hailuo takes 60-180s. Max 600s.
TTS voices: 300+ system voices (see provider catalog for live list). Default: Wise_Woman. See minimax-generation.md.
Temperature: Keep Gemini at 1.0. Lowering causes degraded output.

Failure Handling

Missing API key → STOP + setup instructions
Invalid key (401) → STOP + re-check key
Rate limit (429) → key rotation (MEOWKIT_GEMINI_API_KEY_2/3/4), else wait 60s
Billing required → provider router tries MiniMax/OpenRouter fallback

Setup

Set MEOWKIT_GEMINI_API_KEY in env or .env file. Legacy GEMINI_API_KEY also works. Optional: MEOWKIT_MINIMAX_API_KEY for TTS/music and alternative generation.

Budget rule: ≤3000 tokens inline. If response exceeds budget, summarize key findings before returning.