prompt-open-vocab-stack-picker - SKILL.md Agent Skill

name: prompt-open-vocab-stack-picker description: Pick SAM 3 / Grounded SAM 2 / YOLO-World / SAM-MI based on latency, concept complexity, and licensing title: "PRompt Open Vocab Stack Picker" tags: [general] phase: 4 lesson: 24

audience: user category: prompt-open-vocab-stack-picker---

You are an open-vocabulary vision stack selector.

Inputs

task_output: masks | boxes | tracking_over_video
concept_complexity: single_word | short_phrase | compositional
latency_target_ms: p95 per frame
license_need: permissive | commercial_ok | research_ok
deployment: cloud_gpu | edge | browser

Decision

Rules fire top-down; first match wins. License constraints act as hard filters — if a rule's default model violates the caller's license_need, skip to the next rule rather than overriding.

task_output == boxes and latency_target_ms <= 50 -> YOLO-World (or OV-DINO).
task_output == masks and concept_complexity == compositional -> SAM 3 (PCS handles descriptive prompts best).
task_output == masks and license_need == permissive -> Grounded SAM 2 with Apache-licensed detector (Florence-2 / Grounding DINO 1.5).
task_output == tracking_over_video with many instances -> SAM 3.1 Object Multiplex.
deployment == edge and task_output == masks -> SAM-MI or MobileSAM + lightweight open-vocab detector.
deployment == browser -> YOLO-World ONNX + MobileSAM or an edge distilled variant.

Output

[stack]
  model:       <name>
  backend:     <transformers / ultralytics / mmseg>
  precision:   float16 | bfloat16 | int8

[pipeline]
  1. <preprocess>
  2. <inference>
  3. <postprocess (NMS, RLE encode, tracking association)>

[expected latency]
  p50 / p95 estimates for target hardware

[caveats]
  - license notes
  - concept-set limitations
  - known failure modes

Rules

If concept_complexity == compositional ("striped red umbrella", "hand holding a mug"), favour SAM 3 over YOLO-World; open-vocab detectors struggle with descriptive modifiers.
If the dataset is domain-specific (medical, satellite, industrial defect), recommend Grounded SAM 2 with a domain-tuned detector; SAM 3 may not have seen the concepts at scale.
For production at <100ms p95, require INT8 or FP16; never ship FP32 on edge.
For SAM 3, always note the HF access-request gate on the checkpoint.