mk-multimodal

star 15

Process images, video, audio, PDFs with Gemini API. Generate images (Nano Banana 2), videos (Veo 3), speech (MiniMax TTS), music (MiniMax). Convert documents to Markdown. Multi-provider fallback (Gemini → MiniMax → OpenRouter). Activate when task references media files, asks to analyze/describe/transcribe/extract/generate/convert, or involves non-text binary files.

ngocsangyem By ngocsangyem schedule Updated 6/11/2026

name: mk:multimodal description: Process images, video, audio, PDFs with Gemini API. Generate images (Nano Banana 2), videos (Veo 3), speech (MiniMax TTS), music (MiniMax). Convert documents to Markdown. Multi-provider fallback (Gemini → MiniMax → OpenRouter). Activate when task references media files, asks to analyze/describe/transcribe/extract/generate/convert, or involves non-text binary files. version: 2.1.0 argument-hint: '[file-path] [task]' phase: on-demand source: local trust_level: kit-authored injection_risk: low requires_env:

  • name: MEOWKIT_GEMINI_API_KEY required: true description: Google Gemini API key (falls back to GEMINI_API_KEY) setup: Get from https://aistudio.google.com/apikey fallback: Analysis requires this key. Generation falls back to MiniMax/OpenRouter if available.
  • name: MEOWKIT_MINIMAX_API_KEY required: false description: MiniMax API key for image/video/TTS/music generation setup: Get from https://platform.minimax.io/ keywords:
  • multimodal
  • gemini
  • image-gen
  • video-gen
  • tts
  • transcription
  • document-conversion
  • vision
  • image
  • video
  • audio
  • minimax when_to_use: Use when processing images, video, audio, PDFs or generating media via Gemini/MiniMax/OpenRouter. Activate when task references media files. user-invocable: true allowed-tools:
  • Bash
  • Read
  • Write
  • Edit owner: utility criticality: medium status: active runtime: claude-code requires_external_service: ["gemini", "minimax"] default_enabled: false

Multimodal Analysis & Generation

Path convention: Commands below assume cwd is $CLAUDE_PROJECT_DIR (project root). Prefix paths with "$CLAUDE_PROJECT_DIR/" when invoking from subdirectories.

Overview

Analyze images, video, audio, and documents — or generate images, videos, speech, and music — using Gemini and MiniMax APIs. Multi-provider fallback: Gemini → MiniMax → OpenRouter. All routing and models configurable via .env.

Models & Limits

Model IDs (verified 2026-04-22; provider model roster changes — re-verify via provider dashboards before relying on specific preview IDs):

  • Analysis: gemini-2.5-flash (stable default)
  • Image generation: gemini-3.1-flash-image-preview (Nano Banana 2, preview-channel)
  • Video generation: veo-3.1-generate-preview (Gemini, preview-channel) or MiniMax-Hailuo-2.3
  • TTS: speech-2.8-hd
  • Music: music-2.6

Preview-channel IDs may rotate without notice. Full live table with pricing: references/models-and-pricing.md.

MiniMax models require MEOWKIT_MINIMAX_API_KEY: image-01 (images), MiniMax-Hailuo-2.3 (video), speech-2.8-hd (TTS), music-2.6 (music).

Scripts

scripts/gemini_analyze.py — Analyze media files

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/gemini_analyze.py \
  --files <path> --task <analyze|transcribe|extract> [--resolution low-res] [--json] [--verbose]

scripts/gemini_generate.py — Generate images/videos (Gemini, with --provider for fallback)

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/gemini_generate.py \
  --task <generate-image|generate-video> --prompt "description" [--provider minimax] [--json]

scripts/minimax_generate.py — MiniMax generation (image, video, TTS, music)

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/minimax_generate.py \
  --task <generate-image|generate-video|generate-speech|generate-music> --prompt "..." [--json]

scripts/document_converter.py — Convert documents to Markdown

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/document_converter.py \
  --files doc.pdf --output ./docs/ [--json] [--verbose]

scripts/check_setup.py — Verify API keys and dependencies

.claude/skills/.venv/bin/python3 .claude/skills/multimodal/scripts/check_setup.py

Prints PASS/FAIL status for each API key (MEOWKIT_GEMINI_API_KEY, MEOWKIT_MINIMAX_API_KEY, optional MEOWKIT_OPENROUTER_API_KEY) and core Python dependencies. Run before first use to surface missing env vars.

References

Load only when executing the corresponding step.

Reference When to load Content
vision-understanding.md Image analysis Detection, segmentation, OCR
audio-processing.md Audio/video transcription Formats, splitting, timestamps
models-and-pricing.md Model selection, cost Full model table, pricing
image-generation.md Image generation Nano Banana 2, OpenRouter fallback
video-generation.md Video generation Veo 3.1, async polling
video-analysis.md Video analysis Resolution modes, cost math
minimax-generation.md MiniMax generation Image, video, TTS, music
document-conversion.md Document conversion Formats, batch mode

When to Use

Auto-activate on these patterns:

  • Task references image (.png, .jpg, .webp), video (.mp4, .mov), audio (.mp3, .wav), or document (.pdf, .docx) files
  • Task asks to "analyze", "describe", "transcribe", "extract", "OCR"
  • Task asks to "generate image", "generate video", "create image"
  • Task asks to "generate speech", "text to speech", "TTS"
  • Task asks to "generate music", "create music"
  • Task asks to "convert document", "convert PDF to markdown"
  • User references a non-text binary file for processing

Do NOT invoke when: text-only files (use Read), Gemini API docs (use mk:docs-finder), already-described image in context.

API Key Check

Check: is MEOWKIT_GEMINI_API_KEY (or legacy GEMINI_API_KEY) set?

  • If yes → proceed with Gemini
  • If no → check MEOWKIT_MINIMAX_API_KEY for generation tasks
  • If neither → STOP with setup instructions

Analysis Process

Detect media format → select model (default: gemini-2.5-flash) → run script. For >20MB files, use File API. Use --resolution low-res for video (62% savings, video only). Output capped at ~3000 tokens; prefer structured JSON/Markdown.

Generation Process

Provider router auto-selects best available provider by checking API keys:

  • Image: Gemini → MiniMax → OpenRouter
  • Video: Gemini → MiniMax
  • TTS: MiniMax only
  • Music: MiniMax only

Force a provider: --provider gemini|minimax|openrouter. Override chain order via MEOWKIT_IMAGE_PROVIDER_CHAIN etc.

Files in This Skill

mk:multimodal/
├── SKILL.md
├── references/               — per-capability reference files (8 files)
├── workflows/                — step-by-step workflow guides
│   ├── audio-transcription.md
│   ├── document-extraction.md
│   └── image-analysis.md
└── scripts/                  — Python modules (run via .claude/skills/.venv/bin/python3)
    ├── gemini_analyze.py     — primary analysis driver (images, video, audio, PDFs)
    ├── gemini_generate.py    — Gemini image/video generation with provider fallback
    ├── minimax_generate.py   — MiniMax image/video/TTS/music generation
    ├── document_converter.py — PDF/docx → Markdown conversion
    ├── check_setup.py        — API key and dependency verification
    ├── provider_router.py    — multi-provider selection logic
    ├── api_key_rotator.py    — Gemini key rotation on rate limits
    ├── minimax_api_client.py — MiniMax API client wrapper
    ├── openrouter_fallback.py — OpenRouter fallback for image gen
    ├── media_optimizer.py    — pre-processing and chunking for large files
    ├── video_generator.py    — async video polling + download helper
    ├── analyze_constants.py  — shared model/endpoint constants
    ├── analyze_core.py       — shared analysis logic reused by analyze scripts
    └── env_utils.py          — env var loading and validation helpers

Gotchas

  • Python venv required: if you get python3: command not found or import errors, run .claude/scripts/bin/setup-workflow once from the project root.
  • Audio >15 min: Gemini truncates silently. Split first.
  • PDF >100 pages: Quality degrades. Process in 20-page chunks.
  • Video cost: ~263 tokens/sec at default resolution. Use --verbose for cost estimate.
  • Image gen requires billing: Free tier = no gen. Use MiniMax/OpenRouter as fallback.
  • MiniMax video timeout: Hailuo takes 60-180s. Max 600s.
  • TTS voices: 300+ system voices (see provider catalog for live list). Default: Wise_Woman. See minimax-generation.md.
  • Temperature: Keep Gemini at 1.0. Lowering causes degraded output.

Failure Handling

  • Missing API key → STOP + setup instructions
  • Invalid key (401) → STOP + re-check key
  • Rate limit (429) → key rotation (MEOWKIT_GEMINI_API_KEY_2/3/4), else wait 60s
  • Billing required → provider router tries MiniMax/OpenRouter fallback

Setup

Set MEOWKIT_GEMINI_API_KEY in env or .env file. Legacy GEMINI_API_KEY also works. Optional: MEOWKIT_MINIMAX_API_KEY for TTS/music and alternative generation.

Budget rule: ≤3000 tokens inline. If response exceeds budget, summarize key findings before returning.

Install via CLI
npx skills add https://github.com/ngocsangyem/MeowKit --skill mk-multimodal
Repository Details
star Stars 15
call_split Forks 2
navigation Branch main
article Path SKILL.md
More from Creator