name: vhscli
description: Use the vhscli CLI to analyze images/video/pdfs with a prompt, or generate images/videos. Use when the user asks about local media, wants AI images/videos, or mentions vhscli, vhs, seedream, seedance, nano-banana, or gpt-image.
vhscli
vhscli is a command-line tool for multimodal AI: chat about
text/images/video/pdfs, or generate images and videos from prompts. It's a thin
client — auth, uploads, and model execution all happen server-side, so users
don't store any provider API keys locally.
Run vhscli --help or vhscli <command> --help to see current help — the CLI
is the source of truth.
Invocation
Always run via npx @getvhs/vhscli@latest so you pick up the newest models,
flags, and fixes. Don't pin a version, and don't call a bare vhscli binary
even if one is on PATH — it may be stale.
npx @getvhs/vhscli@latest <command> ...
Throughout this doc, commands are written as vhscli ... for readability —
substitute npx @getvhs/vhscli@latest ... when running.
Requires Node.js ≥ 22.
Top-level
vhscli [-v|--version] [-h|--help]
vhscli <command> [options] ...
-v,--version— print version (only when no command is given)-h,--help— show help (works on root and every subcommand)
Commands:
login— log in with google (opens browser; saves session to~/.vhs/session.json)logout— log out and delete local access tokenswhoami— print the logged-in user's emailmodels— list available modelsgenerate <model> <prompt> [-o <path>]— generate an image or video, wait, and save itsubmit <model> <prompt> [-o <path>]— submit the same task asgeneratebut exit immediately (writes a<output>.vhs_tasksidecar to resume later)chat <prompt>— chat with seed-2.0 (text, image, video, or pdf input)resume <files...>— finish one or more aborted generations from their.vhs_tasksidecar files
Auth
Assume auth is already configured. If a command fails with an auth error, run
vhscli login to open a browser for Google OAuth. Do NOT run vhscli login
preemptively — it requires interactive browser login.
Models
- Chat / understand (text / image / video / pdf):
seed-2.0— undervhscli chat - Generate images:
seedream-5(default),seedream-4-5,nano-banana-2,nano-banana-pro,gpt-image-2— undervhscli generate - Generate video:
seedance-2— undervhscli generate
Prompt guides
Before you invoke vhscli generate (or do non-trivial understanding with
vhscli chat), Read the matching prompt guide first and shape the prompt
around it. The guides are concise, model-specific references distilled from
each provider's docs — formulas, what to lead with, what works, what fails.
Wording that's great for one model often underperforms on another, so don't
skip this.
| Model(s) | Guide file (Read before prompting) |
|---|---|
seed-2.0 (used by vhscli chat) |
prompt_guide/seed-2.txt |
seedream-5, seedream-4-5 |
prompt_guide/seedream.txt |
nano-banana-2, nano-banana-pro |
prompt_guide/nano-banana.txt |
seedance-2 |
prompt_guide/seedance-2.txt |
gpt-image-2 |
prompt_guide/gpt-image-2.txt |
Trigger: any time the user asks for output from one of these models, Read its
guide before building the prompt. For trivial chat (plain text Q&A with no
media) you can skip seed-2.txt.
Stdin prompts
Every command that takes a prompt also accepts - as the prompt, meaning
"read from stdin":
cat my_prompt.txt | vhscli generate nano-banana-pro -
echo "what is this?" | vhscli chat - -i photo.jpg
vhscli chat — chat about text, images, video, or pdfs
vhscli chat <prompt> [-i <image>...] [-f <pdf>...] [-v <video>] [--fps <n>]
Mode is picked from your flags:
- prompt only → text chat
-i→ ask about images (repeatable)-f→ ask about pdf documents (repeatable)-v→ ask about a single video
Options:
-i <path>— image to ask about (repeat-ifor more)-f <path>— pdf document to ask about (repeat-ffor more)-v <path>— single video to ask about--fps <n>— frames/sec sampled from the video, 0.2–5 (default: 1)
One-shot — each call is independent, no memory of previous calls. Output goes to stdout, nothing is saved to disk. Audio inside a video is not understood.
Examples:
vhscli chat "explain how to make sourdough in 5 steps"
vhscli chat "describe the scene. return json with objects, setting, mood." -i photo.jpg
vhscli chat "transcribe all visible text verbatim, preserving line breaks." -i receipt.jpg
vhscli chat "compare image 1 and image 2 in 3 bullets." -i a.jpg -i b.jpg
vhscli chat "summarize this paper in 5 bullets; include a page number per bullet." -f paper.pdf
vhscli chat "list key events with start_time and end_time in HH:mm:ss as json." -v clip.mp4 --fps 2
vhscli generate seedream-5 — generate an image (default choice)
vhscli generate seedream-5 <prompt> [-o <path>] [-i <image>...] [--size <size>]
Options:
-o,--output <path>— output file path (default:vhscli-seedream-5-<timestamp>.jpg)-i <path>— reference image, max 14 (repeat-ifor more)--size <size>—2K,3K, orWxHlike1024x1536(default: 2K)- WxH pixel count must be in [3,686,400, 10,404,496]
- WxH aspect ratio must be in [1:16, 16:1]
Output format follows the -o extension (.png, .jpg/.jpeg, .webp); the
CLI converts if needed.
Examples:
vhscli generate seedream-5 "a red fox in a snowy forest" -o fox.jpg
vhscli generate seedream-5 "swap the outfit" -o out.png -i person.jpg -i outfit.jpg --size 3K
vhscli generate seedream-4-5 — generate an image (larger size range)
vhscli generate seedream-4-5 <prompt> [-o <path>] [-i <image>...] [--size <size>]
Options:
-o,--output <path>— output file path (default:vhscli-seedream-4-5-<timestamp>.jpg)-i <path>— reference image, max 14 (repeat-ifor more)--size <size>—2K,4K, orWxH(default: 2K)- WxH pixel count must be in [3,686,400, 16,777,216]
- WxH aspect ratio must be in [1:16, 16:1]
Example:
vhscli generate seedream-4-5 "a mountain at sunrise" -o mountain.jpg --size 4K
vhscli generate nano-banana-2 — generate an image (Google)
vhscli generate nano-banana-2 <prompt> [-o <path>] [-i <image>...] [--size <size>]
Options:
-o,--output <path>— output file path (default:vhscli-nano-banana-2-<timestamp>.png)-i <path>— reference image, max 14 (repeat-ifor more)--size <size>—512,1K,2K, or4K(default: 1K)
Output is always square (1:1). Describe the framing you want in the prompt if you need a tall or wide composition.
Examples:
vhscli generate nano-banana-2 "remove the man from the photo, keep everything else" -i photo.jpg
vhscli generate nano-banana-2 "90s skateboarder poster, vertical composition" -o poster.png --size 2K
vhscli generate nano-banana-2 "a glossy candle in a bell jar on a marble counter, soft light"
vhscli generate nano-banana-pro — generate an image (Google, premium)
vhscli generate nano-banana-pro <prompt> [-o <path>] [-i <image>...] [--size <size>]
Options:
-o,--output <path>— output file path (default:vhscli-nano-banana-pro-<timestamp>.png)-i <path>— reference image, max 14 (repeat-ifor more)--size <size>—1K,2K, or4K(default: 1K)
Output is always square (1:1). Higher-quality sibling of nano-banana-2 — better text rendering and richer textures.
Examples:
vhscli generate nano-banana-pro "studio portrait, cinematic lighting, three-quarter framing" -o portrait.jpg --size 2K
vhscli generate nano-banana-pro "a sun-drenched minimalist living room with a 3d armchair from this sketch" -i sketch.jpg
vhscli generate gpt-image-2 — generate or edit an image (OpenAI)
vhscli generate gpt-image-2 <prompt> [-o <path>] [-i <image>...] [--size <size>]
Options:
-o,--output <path>— output file path (default:vhscli-gpt-image-2-<timestamp>.png)-i <path>— reference image for edits (repeat-ifor more)--size <size>— preset (1024x1024,1536x1024,1024x1536,2048x2048,2048x1152,3840x2160) orWxH(default: 1024x1024)- both sides must be multiples of 16, max edge 3840
- total pixels in [655,360, 8,294,400]
- aspect ratio in [1:3, 3:1]
Output format follows the -o extension (.png, .jpg/.jpeg, .webp); the
CLI converts if needed. Use png or webp when you need transparency.
Examples:
vhscli generate gpt-image-2 "a children's book drawing of a veterinarian examining a cat"
vhscli generate gpt-image-2 "replace the background with a starry night, keep the subject unchanged" -i photo.jpg
vhscli generate gpt-image-2 "ultra-wide landscape of the swiss alps at golden hour" --size 3840x2160 -o alps.jpg
vhscli generate seedance-2 — generate a video
vhscli generate seedance-2 <prompt> [-o <path>]
[--first-frame <image>] [--last-frame <image>]
[-i <image>...] [-v <video>...] [-a <audio>...]
[--ratio <r>] [--resolution <res>] [--duration <n>]
[--no-audio]
Mode is picked from your flags:
- prompt only → text-to-video
--first-frame→ animate from that frame (optionally--last-frametoo)-i/-v/-a→ use as references
Options:
-o,--output <path>— output file path (default:vhscli-seedance-2-<timestamp>.mp4)--first-frame <image>— use as the first frame--last-frame <image>— use as the last frame (requires--first-frame)-i <path>— reference image, max 9 (repeat-i). conflicts with--first-frame-v <path>— reference video, max 3 (repeat-v)-a <path>— reference audio, max 3 (repeat-a). requires-ior-v--ratio <r>— aspect ratio (default: 16:9). one of:16:9,4:3,1:1,3:4,9:16,21:9--resolution <res>—480p,720p, or1080p(default: 720p)--duration <n>— length in seconds, 4–15 (default: 5)--audio/--no-audio— toggle the audio track (default:--audio). pass--no-audiofor a silent video
Defaults to 5s @ 720p, 16:9, with audio. Jobs run in the cloud and can take
minutes — the CLI polls automatically. If you don't want to block, use
vhscli submit seedance-2 ... (same flags) to detach immediately, then
vhscli resume <output>.vhs_task later. If a vhscli generate is interrupted
mid-poll, the sidecar it wrote at start (<output>.vhs_task) is what you pass
to resume.
Examples:
# text-to-video
vhscli generate seedance-2 "a cat jumping off a couch" -o cat.mp4 --duration 6 --ratio 16:9
# animate a still image
vhscli generate seedance-2 "camera pans right" -o pan.mp4 --first-frame start.jpg
# with a first and last frame
vhscli generate seedance-2 "morph between these" -o morph.mp4 --first-frame a.jpg --last-frame b.jpg
# reference-based with audio
vhscli generate seedance-2 "lip sync the words" -o out.mp4 -i face.jpg -a voice.mp3
The .vhs_task sidecar — what generate, submit, and resume share
As soon as vhscli generate or vhscli submit has a task id from the
backend, it writes a tiny sidecar next to the intended output:
<output>.vhs_task # JSON: {"id": "<uuid>"}
# e.g. clip.mp4.vhs_task, fox.jpg.vhs_task
generatekeeps polling and, on success, saves the media to<output>and removes the sidecar. On a task error it also removes the sidecar and exits non-zero.submitwrites the sidecar and exits immediately, leaving the backend task running.resume <files...>re-attaches to one or more sidecars: waits if the task is still running, saves the media to the path implied by the sidecar filename (clip.mp4.vhs_task→clip.mp4), and removes the sidecar.
vhscli chat does not use this sidecar — chat is fast and prints to stdout.
If -o was not passed, the sidecar is named after the auto-generated default
output (vhscli-<model>-<timestamp>.<ext>.vhs_task in the current folder).
For long jobs (seedance-2 especially), pass -o so the sidecar has a
predictable name you can resume.
vhscli submit — submit a task and exit (don't wait)
vhscli submit <model> <prompt> [-o <path>] [...same flags as `vhscli generate <model>`]
submit takes the same models and the same options as generate
(seedance-2, seedream-5, seedream-4-5, nano-banana-2,
nano-banana-pro, gpt-image-2). The only difference is that after creating
the task and writing <output>.vhs_task, it exits without polling.
Use it when:
- The job is long (e.g. seedance video) and you don't want to keep the terminal blocked.
- You want to fan out several tasks in parallel and pull results later.
Pair it with vhscli resume <output>.vhs_task to fetch the result.
Examples:
# kick off a video, get the terminal back, finish later
vhscli submit seedance-2 "a robot dancing in tokyo at night" -o robot.mp4
# ... do other work ...
vhscli resume robot.mp4.vhs_task
# fan out several image jobs, then collect them all
vhscli submit seedream-5 "a red fox in a snowy forest" -o fox.jpg
vhscli submit seedream-5 "a blue jay on a branch" -o jay.jpg
vhscli submit seedream-5 "an orca breaching" -o orca.jpg
vhscli resume fox.jpg.vhs_task jay.jpg.vhs_task orca.jpg.vhs_task
vhscli resume — finish aborted generations from sidecar files
vhscli resume <files...>
Takes one or more .vhs_task sidecar files (any mix of models). For each
sidecar, resume:
- Reads the task id from the sidecar.
- Derives the output path by stripping the trailing
.vhs_task(clip.mp4.vhs_task→clip.mp4). The extension on that path sets the saved format; the CLI converts if needed. - Waits for the task to finish, saves the media, and removes the sidecar on success (or on a non-recoverable task error).
- Processes files sequentially; exits non-zero on the first failure (later sidecars stay on disk and can be resumed again).
When to use resume:
- You ran
vhscli submit ...and now want the result. - Your
vhscli generate ...was interrupted (ctrl-c, crash, closed terminal, lost network) — the sidecar it wrote at the start is still on disk.
You cannot resume by raw task id any more; if you only have an id, recreate
the sidecar manually: echo '{"id":"<uuid>"}' > out.mp4.vhs_task.
Examples:
vhscli resume clip.mp4.vhs_task
vhscli resume a.jpg.vhs_task b.jpg.vhs_task c.jpg.vhs_task
Understanding local images, video, and pdfs
Do NOT use the Read tool, or any built-in file-reading capability, to "look
at" images, video, or pdfs. That path either fails or gives you a garbled
snippet. The only correct way to understand local visual or document content
is vhscli chat with -i / -v / -f.
vhscli chat "what's happening?" -i photo.jpg
vhscli chat "transcribe the speech" -v clip.mp4 --fps 2
vhscli chat "summarize this paper" -f paper.pdf
Prompt patterns for visual / document understanding
vhscli chat understands images, pdfs, and video frames, but not audio
inside videos. Ask for structured JSON output when you'll parse the
answer, and name every field you want. Be explicit about formats
(timestamp style, units, language).
Image — describe / classify:
vhscli chat "describe the scene. return json {objects:[{label,bbox?}], setting, mood, dominant_colors:[]}." -i photo.jpg
vhscli chat "classify the image into one of: cat, dog, bird, other. return json {label, confidence_0_1, reasoning}." -i pic.jpg
Image — OCR / text extraction:
vhscli chat "transcribe all visible text verbatim, preserving line breaks and reading order. do not paraphrase." -i receipt.jpg
vhscli chat "extract the receipt as json {merchant, date_iso, items:[{name, qty, unit_price, line_total}], subtotal, tax, total, currency}." -i receipt.jpg
Image — comparison (number them in the prompt):
vhscli chat "compare image 1 and image 2. return json {same_subject:bool, differences:[], which_is_better, why}." -i a.jpg -i b.jpg
vhscli chat "image 1 is the original, image 2 is an edit. list every visible change as json {changes:[{region, before, after}]}." -i orig.png -i edit.png
PDF — summarize / outline (always ask for page anchors):
vhscli chat "summarize this paper in 5 bullets. each bullet must include the source page as {page:int, point:string}. return json {bullets:[...]}." -f paper.pdf
vhscli chat "extract the outline as json [{page, heading_level, heading, bullets:[]}]." -f doc.pdf
PDF — QA / extraction:
vhscli chat "answer using only this document. question: what is the experimental setup? return json {answer, citations:[{page, quote}]}." -f paper.pdf
vhscli chat "extract every table as json [{page, title?, headers:[], rows:[[...]]}]." -f report.pdf
Video — events / timeline (state the timestamp format):
vhscli chat "list key events. return json [{start_time, end_time, event}]. use HH:mm:ss." -v clip.mp4 --fps 2
vhscli chat "describe the movement sequence and any safety risks. return json [{start_time, end_time, event, danger:'none'|'low'|'med'|'high'}]. HH:mm:ss." -v clip.mp4 --fps 3
Video — temporal QA / counting:
vhscli chat "at what timestamp does the referee first appear? return json {timestamp_hms, evidence}." -v match.mp4 --fps 2
vhscli chat "count how many distinct people appear. return json {count, per_person:[{first_seen_hms, description}]}." -v scene.mp4 --fps 3
Choosing --fps for video (default 1, range 0.2–5):
- 3–5 — counting actions, sports, fast cuts, dense motion.
- 1 — general description, dialogue scenes.
- 0.2–0.5 — long static footage, headcount, slow surveillance.
Higher fps = more detail but more tokens and slower. Lower fps = cheaper but may miss brief events.
Tips
- Always quote prompts.
-ois optional forvhscli generate/vhscli submit— defaults tovhscli-<model>-<timestamp>.<ext>in the current folder. Output format follows the-oextension; the CLI converts if needed. Forsubmit, pass-oso the resulting<output>.vhs_tasksidecar has a name you can find later.- Short options accept no-space form:
-ofoo.jpg. Long options accept=:--size=2K. - Use
--to pass a prompt starting with a dash:vhscli generate seedream-5 -o x.jpg -- "-weird prompt". - Reference images (
-i,--first-frame,--last-frame) can be any common format; non-JPEG/PNG inputs (e.g. HEIC, WebP, TIFF, BMP) are converted to JPEG before upload. - Uploads are deduplicated by content hash, so passing the same reference repeatedly is cheap.
- Unknown command?
vhscliwill suggest the closest match.