name: vision
description: >
Nested swiss-knife reference for image understanding — a decision tree that
routes between three paths depending on what the agent has access to: (1) the
built-in vision tool if the LLM provider supports image input, (2) the
sibling minimax-cli reference if a usable MiniMax preset/key slot is
available, or (3) a local Hugging Face VLM via the bundled
scripts/describe.py if neither — offline, free, slow. Read this when you
need to describe, OCR, or critique an image and you're not sure which path
applies.
version: 1.0.0
tags: [vision, image, ocr, vlm, decision-tree, nested-skill]
vision (router)
Nested swiss-knife reference for image understanding. Three paths — pick the cheapest available.
Decision Tree
Is `vision` tool in your tool list?
├── YES → use it directly, done.
└── NO → does the user have a MiniMax preset/key slot?
(read the sibling `../minimax-cli/SKILL.md`; scan ~/.lingtai-tui/presets/** without printing keys)
├── YES → use the `minimax-cli` reference for `mmx vision …` from the shell, or an already-registered MCP tool
└── NO → fall back to local VLM:
python3 <skill-path>/scripts/describe.py <image>
See reference/local-models.md.
Path 1 — Built-in vision Tool
If your LLM provider supports image input (MiniMax, Gemini, Anthropic, OpenAI, Zhipu), the kernel exposes a vision tool directly. Cheapest, lowest latency, no extra setup.
vision not in your tool list? Your LLM provider doesn't support image input — fall through to Path 2 or 3.
Path 2 — MiniMax via the sibling minimax-cli reference
For text-only LLMs (DeepSeek, OpenRouter text-only, Codex) with a usable MiniMax preset/key slot. Two routes can exist:
- Shell —
mmx vision …via the official CLI. No MCP registration needed; just install + key. Best for ad-hoc one-shots in bash. - In-tool — an already-registered MiniMax MCP vision tool. Best when the agent needs vision as a tool call inside a longer reasoning loop. MCP server registration is owned by
mcp-manual(kernelmcpcapability).
Read the sibling minimax-cli reference (../minimax-cli/SKILL.md) before using the shell route. It owns the canonical MiniMax credential discovery flow: scan TUI presets recursively, pick the declared slot, export it without printing the key, and match the preset region/base URL. Do not assume a bare MINIMAX_API_KEY is the only valid slot.
Path 3 — Local VLM (offline, unlimited)
For agents that need image analysis without a vision-capable LLM and without a MiniMax key. Or for batch jobs / privacy-sensitive content / unlimited quota.
Run a Hugging Face vision-language model locally:
python3 <skill-path>/scripts/describe.py <image-path> [--prompt PROMPT] [--model qwen2-vl-2b|moondream2|qwen2-vl-7b]
First call downloads weights (2–15 GB depending on model), subsequent calls reuse the cache. The script auto-installs transformers + torch + model-specific deps via lingtai.venv_resolve.ensure_package.
Output is JSON on stdout: {image, backend, prompt, response, elapsed_seconds}. Errors → stderr, non-zero exit.
See reference/local-models.md for model selection, hardware tradeoffs, prompt templates per model, batch patterns (load model once, loop many images), and failure modes.
When NOT to use this skill
- Your LLM already has
visionin its tool list — Path 1, no decision needed. - You need to generate an image — use the sibling
minimax-clireference (../minimax-cli/SKILL.md,mmx image …). - You need to describe video frames — extract frames with
ffmpegfirst, then loop this skill over them.
Found a bug or issue? If you encounter any problems with this skill, load the
lingtai-issue-reportskill and follow its instructions to report it.