name: ai-multimodal description: > Use this skill when building multimodal systems: CLIP, LLaVA, BLIP, Qwen-VL, GPT-4V, vision-language models, image captioning, visual QA, multimodal RAG, image understanding, video understanding, multimodal search. This skill enforces: VLM selection based on task requirements, multimodal RAG architecture with separate/text+image retrieval, CLIP embedding strategy for cross-modal search, prompting strategy for VLMs. Do NOT use for: text-only LLM fine-tuning, audio/speech processing, text embedding model selection. version: "1.0.0" author: "j4flmao" license: "MIT" type: skill compatibility: claude-code: true cursor: true codex: true windsurf: true tags: [ai, multimodal, vision-language, phase-11]
Multimodal Agent
Purpose
Design multimodal AI systems with vision-language models, image understanding, multimodal RAG, and cross-modal search using CLIP embeddings.
Agent Protocol
Trigger
User request includes: multimodal, CLIP, LLaVA, GPT-4V, vision-language, image captioning, visual QA, multimodal RAG, BLIP, Qwen-VL, image understanding, video understanding, multimodal search.
Protocol
- Clarify task type: image captioning, VQA, multimodal retrieval, or video understanding.
- Select VLM based on task, latency, and quality requirements.
- Design multimodal RAG architecture: separate or unified embedding spaces.
- Configure CLIP embeddings for cross-modal text-image search.
- Define prompting strategy for vision-language models.
- Set up evaluation metrics for multimodal tasks.
Output
Multimodal architecture with model selection, pipeline design, RAG strategy.
Response Format
## Multimodal Configuration
### Model
VLM: {CLIP / LLaVA / BLIP / Qwen-VL / GPT-4V}
Vision Encoder: {ViT-L/14 / SigLIP / EVA-CLIP}
Language Model: {LLaMA / Qwen / GPT-4}
Resolution: {224x224 / 336x336 / 448x448}
### Task
Type: {captioning / VQA / retrieval / video}
Input: {image / image+text / video+text}
Output: {text / embeddings / bounding boxes}
### Multimodal RAG
Strategy: {separate / unified embedding}
Image Retriever: {CLIP ViT-L/14}
Text Retriever: {BGE-large / OpenAI ada-002}
Fusion: {late fusion / cross-attention}
Context Format: {interleaved image+text / image-block + text}
### Evaluation
Metric: {CIDEr / BLEU / CLIPScore / retrieval recall@k}
Baseline: {score}
Target: {score}
No preamble. No postamble. No explanations. No filler/hedging/transitions. Compress output — why use many token when few do trick.
Completion Criteria
- VLM selected with task-specific rationale (quality, latency, cost).
- Multimodal RAG architecture designed with retrieval and fusion strategy.
- CLIP embedding configuration for cross-modal search.
- Prompting strategy for VLM inputs (image + text formatting).
- Evaluation metrics chosen for task type.
- Video understanding approach specified (frame sampling, temporal modeling).
Workflow
Step 1: Select Vision-Language Model
- GPT-4V: Highest quality, general-purpose. Best for complex VQA and reasoning. API-based, higher cost.
- LLaVA-1.6: Open-source, fine-tuned LLaMA + CLIP. Best for self-hosted VQA and captioning.
- BLIP-3 / Florence-2: Strong captioning and fine-grained understanding. Instruction-tuned.
- Qwen-VL: Strong multilingual multimodal. Best for non-English inputs.
- CLIP: Embeddings only (no generation). Best for multimodal search and zero-shot classification.
Step 2: Image Understanding with VLM
# LLaVA inference
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
prompt = "[INST] <image>\nDescribe this image in detail [/INST]"
inputs = processor(text=prompt, images=image, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200)
caption = processor.decode(output[0], skip_special_tokens=True)
Step 3: CLIP Embeddings for Multimodal Search
from transformers import CLIPProcessor, CLIPModel
import torch
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# Encode images and text into shared embedding space
image_inputs = processor(images=images, return_tensors="pt")
text_inputs = processor(text=["a photo of a cat", "a dog in the park"], return_tensors="pt")
with torch.no_grad():
image_embeds = model.get_image_features(**image_inputs)
text_embeds = model.get_text_features(**text_inputs)
# Normalize for cosine similarity
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
# Similarity matrix
similarity = text_embeds @ image_embeds.T
Step 4: Multimodal RAG
# Separate embedding spaces
class MultimodalRAG:
def __init__(self):
self.image_encoder = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
self.text_encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
self.image_index = faiss.IndexFlatIP(768) # CLIP dim
self.text_index = faiss.IndexFlatIP(1024) # BGE dim
def retrieve(self, query, k=5):
query_embed = self.text_encoder.encode(query)
# Retrieve from both indexes
text_scores, text_idx = self.text_index.search(query_embed, k)
image_scores, image_idx = self.image_index.search(query_embed, k)
return text_idx, image_idx
def generate(self, query, retrieved_texts, retrieved_images):
context = format_multimodal_context(retrieved_texts, retrieved_images)
prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
return llm.generate(prompt)
Implementation Patterns
VLM Selection Decision Matrix
Task Type → Recommended VLM → Why
├── Image Captioning (generic)
│ ├── LLaVA-1.6 7B → Best open-source quality/speed tradeoff
│ ├── BLIP-3 → Specialized for detailed region-level captioning
│ └── GPT-4V → Highest quality but $0.01+/image
│
├── Visual Question Answering
│ ├── LLaVA-1.6 13B → Strong reasoning, self-hostable
│ ├── Qwen-VL → Best multilingual VQA
│ └── GPT-4V → Complex reasoning, chart/diagram understanding
│
├── Multimodal Retrieval
│ ├── CLIP ViT-L/14 → 768-dim embeddings, best general retrieval
│ ├── SigLIP → Better multilingual alignment
│ ├── EVA-CLIP → Higher accuracy on zero-shot benchmarks
│ └── JINA CLIP-v2 → 1024-dim, trained for retrieval tasks
│
├── Video Understanding
│ ├── Video-LLaVA → Open-source video-language model
│ ├── GPT-4V → Best but expensive for video frames
│ └── InternVideo2 → Strong spatiotemporal modeling
│
└── Document/Chart Understanding
├── OCR-free VLMs (Donut, Pix2Struct) → No OCR pipeline needed
├── LayoutLMv3 → Layout + text + vision fusion
└── GPT-4V → Best for complex charts and diagrams
Image Preprocessing Pipeline
import torch
import torchvision.transforms as T
from PIL import Image
class MultimodalPreprocessor:
def __init__(self, target_size=336, strategy="pad"):
self.target_size = target_size
self.strategy = strategy
self.transform = self._build_transform()
def _build_transform(self):
return T.Compose([
T.Resize(self.target_size, interpolation=T.InterpolationMode.BICUBIC),
self._get_sizing_strategy(),
T.ToTensor(),
T.Normalize(mean=[0.4815, 0.4578, 0.4082], std=[0.2686, 0.2613, 0.2758]),
])
def _get_sizing_strategy(self):
if self.strategy == "pad":
return T.Pad(padding=self.target_size // 8, fill=0, padding_mode="constant")
elif self.strategy == "center_crop":
return T.CenterCrop(self.target_size)
return T.Lambda(lambda x: x)
def __call__(self, image: Image.Image) -> torch.Tensor:
return self.transform(image).unsqueeze(0)
Unified Embedding Space (CLIP + Text)
class UnifiedMultimodalEncoder:
def __init__(self, clip_model_name="openai/clip-vit-large-patch14", text_model_name="BAAI/bge-large-en-v1.5"):
self.clip = CLIPModel.from_pretrained(clip_model_name)
self.clip_processor = CLIPProcessor.from_pretrained(clip_model_name)
self.text_encoder = SentenceTransformer(text_model_name)
# Projection layer to unify dimensions
self.projection = torch.nn.Linear(768, 1024)
def encode_image(self, image):
inputs = self.clip_processor(images=image, return_tensors="pt")
with torch.no_grad():
emb = self.clip.get_image_features(**inputs)
emb = emb / emb.norm(dim=-1, keepdim=True)
return emb
def encode_text_clip(self, text):
inputs = self.clip_processor(text=text, return_tensors="pt")
with torch.no_grad():
emb = self.clip.get_text_features(**inputs)
emb = emb / emb.norm(dim=-1, keepdim=True)
return emb
def encode_text_bge(self, text):
emb = self.text_encoder.encode(text, normalize_embeddings=True)
return torch.tensor(emb).unsqueeze(0)
Multimodal RAG with Late Fusion
class LateFusionMultimodalRAG:
def __init__(self, image_index, text_index, llm):
self.image_index = image_index
self.text_index = text_index
self.llm = llm
self.fusion_weight = 0.5
def retrieve(self, query: str, k: int = 5) -> dict:
query_emb = self._embed_query(query)
image_scores, image_ids = self.image_index.search(query_emb, k)
text_scores, text_ids = self.text_index.search(query_emb, k)
return {
"images": list(zip(image_ids[0], image_scores[0])),
"texts": list(zip(text_ids[0], text_scores[0])),
}
def fuse_results(self, results: dict, k: int = 5) -> list:
combined = {}
for i, (idx, score) in enumerate(results["images"]):
combined[f"img_{idx}"] = score * self.fusion_weight
for i, (idx, score) in enumerate(results["texts"]):
key = f"txt_{idx}"
combined[key] = combined.get(key, 0) + score * (1 - self.fusion_weight)
return sorted(combined.items(), key=lambda x: -x[1])[:k]
def generate(self, query: str, fused_results: list):
context_parts = []
for key, score in fused_results:
if key.startswith("img_"):
context_parts.append(f"[Image {key}]")
else:
context_parts.append(f"[Text {key}]")
context = "\n".join(context_parts)
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
return self.llm.generate(prompt)
Video Frame Sampling Strategies
class VideoFrameSampler:
def __init__(self, strategy="uniform", max_frames=16):
self.strategy = strategy
self.max_frames = max_frames
def sample(self, video_path: str) -> list:
if self.strategy == "uniform":
return self._uniform_sample(video_path)
elif self.strategy == "keyframe":
return self._keyframe_extraction(video_path)
elif self.strategy == "scene_detect":
return self._scene_based_sampling(video_path)
raise ValueError(f"Unknown strategy: {self.strategy}")
def _uniform_sample(self, video_path):
"""Sample N frames evenly across video duration."""
import cv2
cap = cv2.VideoCapture(video_path)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
indices = [int(i * total_frames / self.max_frames) for i in range(self.max_frames)]
frames = []
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
frames.append(frame)
cap.release()
return frames
def _keyframe_extraction(self, video_path):
"""Extract keyframes using histogram difference."""
import cv2
cap = cv2.VideoCapture(video_path)
frames = []
prev_hist = None
while len(frames) < self.max_frames:
ret, frame = cap.read()
if not ret:
break
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
hist = cv2.calcHist([gray], [0], None, [16], [0, 256])
hist = hist / hist.sum()
if prev_hist is None or cv2.compareHist(prev_hist, hist, cv2.HISTCMP_CHISQR) > 0.5:
frames.append(frame)
prev_hist = hist
cap.release()
return frames
VLM Prompt Engineering Patterns
# Captioning prompt templates
CAPTION_PROMPTS = {
"detailed": (
"Describe this image in detail, including: "
"colors, objects, spatial relationships, text content, "
"people (if any), actions, and overall scene composition."
),
"brief": "Describe this image in one sentence.",
"structured": (
"Analyze this image and provide:\n"
"- Objects detected\n"
"- Scene type\n"
"- Text content (if any)\n"
"- Color palette\n"
"- Estimated mood/atmosphere"
),
}
# VQA prompt templates
VQA_PROMPTS = {
"yes_no": "Answer yes or no: {question}",
"multiple_choice": (
"Choose the correct answer from options A, B, C, or D.\n"
"Question: {question}\n"
"Options:\n{options}\n"
"Answer:"
),
"open_ended": "Question: {question}\nAnswer in 1-3 sentences:",
}
def build_vlm_prompt(task_type: str, image_tag: str = "<image>", **kwargs):
if task_type == "caption":
prompt = CAPTION_PROMPTS.get(kwargs.get("style", "detailed"))
elif task_type == "vqa":
prompt = VQA_PROMPTS.get(kwargs.get("format", "open_ended"))
elif task_type == "retrieval":
prompt = "Retrieve relevant information from the provided context."
else:
raise ValueError(f"Unknown task: {task_type}")
return f"{image_tag}\n{prompt}"
Architecture Decision Trees
Embedding Strategy Selection
Need multimodal search?
├── Single embedding space (CLIP only)
│ ├── Simple text↔image similarity → CLIP ViT-L/14 (768d)
│ ├── Zero-shot classification → CLIP with class prompts
│ └── High recall needed → Ensemble CLIP + SigLIP
│
├── Separate embedding spaces (CLIP + text embedder)
│ ├── Text-to-image retrieval dominant
│ │ → CLIP for images, BGE-large for text
│ │ → Late fusion at retrieval time
│ ├── Cross-modal ranking needed
│ │ → CLIP for both (unified space)
│ │ → Pros: simpler, single index. Cons: lower text quality
│ └── Multilingual search
│ → CLIP multilingual variant + BGE-m3
│ → Align via projection layer
│
└── Hybrid (CLIP + reranker)
├── Stage 1: CLIP retrieves top-100 candidates
├── Stage 2: Cross-modal reranker (e.g., BLIP-3) re-ranks top-20
└── Best accuracy but higher latency and cost
Deployment Architecture Decision
Production deployment requirements?
├── Latency < 500ms → Use distilled VLM (LLaVA-1.6 7B, Phi-3-vision)
├── Latency < 2s → Use LLaVA-1.6 13B or Qwen-VL 7B
├── Latency < 5s → GPT-4V or LLaVA-1.6 34B
├── Self-hosted
│ ├── GPU available (A100 80GB) → LLaVA-34B, Qwen-VL-72B
│ ├── GPU (A100 40GB) → LLaVA-13B, Qwen-VL-7B
│ └── CPU / Edge → MobileCLIP, TinyLLaVA, ONNX-optimized
├── API-based → GPT-4V, Gemini Pro Vision, Claude 3 Vision
└── Cost-sensitive
├── Stage 1: CLIP embedding (low cost, bulk filtering)
└── Stage 2: GPT-4V only for top-5 ranked candidates
Production Considerations
Batching and Caching Strategy
class MultimodalCache:
def __init__(self, cache_dir: str = "./cache", ttl: int = 3600):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(parents=True, exist_ok=True)
self.ttl = ttl
def get_image_embedding(self, image_hash: str) -> torch.Tensor | None:
cache_path = self.cache_dir / f"{image_hash}.pt"
if cache_path.exists():
if time.time() - cache_path.stat().st_mtime < self.ttl:
return torch.load(cache_path)
return None
def set_image_embedding(self, image_hash: str, embedding: torch.Tensor):
cache_path = self.cache_dir / f"{image_hash}.pt"
torch.save(embedding, cache_path)
def batch_process(self, images: list, batch_size: int = 32) -> list:
results = []
for i in range(0, len(images), batch_size):
batch = images[i:i + batch_size]
inputs = self.processor(images=batch, return_tensors="pt", padding=True)
with torch.no_grad():
embeddings = self.model.get_image_features(**inputs)
results.append(embeddings)
return torch.cat(results, dim=0)
Security Considerations
- Prompt injection via images: Adversarial images can contain hidden text that bypasses safety filters. Always apply input sanitization and output filtering for VLM responses.
- Data leakage in VLM training data: Public VLMs may regurgitate sensitive information from training sets. Use dedicated fine-tuned models for proprietary data.
- Image metadata exposure: EXIF data (GPS coordinates, camera info) in uploaded images should be stripped before processing.
- Model poisoning risk: Open-source VLM checkpoints can contain backdoors. Verify model hashes against official releases and scan for suspicious weights.
- Rate limiting and cost control: API-based VLMs (GPT-4V) cost per-image. Implement per-user rate limiting and budget caps to prevent runaway costs.
- Output validation: VLM outputs can contain hallucinations, biases, or offensive content. Apply content filtering and factual consistency checks before surfacing to users.
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| Using raw CLIP embeddings without normalization | Cosine similarity produces meaningless results | Always L2-normalize embeddings before indexing or comparison |
| Feeding all video frames to VLM | Exponentially increases cost and latency with minimal quality gain | Sample 4-16 keyframes using scene detection |
| Single embedding space for all modalities | CLIP text encoder quality is lower than dedicated text embedders | Use separate spaces with late fusion or projection layers |
| Ignoring image resolution in VLM prompts | Low-res images miss fine details needed for accurate captioning | Always pass original resolution metadata alongside downscaled inputs |
| Using GPT-4V for every image | $0.01/image adds up fast; most images don't need GPT-4V quality | Tiered approach: CLIP → BLIP/LLaVA → GPT-4V for hard cases |
| Not handling image rotation/orientation | VLMs trained on correctly oriented images fail on rotated inputs | Apply EXIF-based auto-rotation before VLM processing |
| Batch processing without padding | Different-sized images in a batch cause tensor shape errors | Use consistent resizing + padding to uniform size within batches |
Performance Optimization
- Model quantization: Use 8-bit or 4-bit quantization (bitsandbytes) to reduce VLM VRAM usage by 50-75% with <2% accuracy loss.
- KV-cache optimization: For self-hosted VLMs, enable Flash Attention 2 for 2-3x faster inference on compatible GPUs.
- Embedding index tiering: Store CLIP embeddings in FAISS with IVF-PQ indexing for 10-100x faster search at 5-10% recall cost.
- Async image fetching: Fetch and preprocess images in parallel using asyncio while the LLM generates text for previous samples.
- Image deduplication: Perceptual hashing (pHash) detects near-duplicate images, reducing redundant VLM calls by 20-40%.
- Dynamic resolution scaling: Start with 224px, increase to 336px only if the VLM confidence is below threshold.
- Batch API calls: Group multiple image requests into single API calls where supported (GPT-4V batch API reduces cost by 50%).
- Speculative decoding: Use a small draft model (TinyLLaVA) to predict VLM outputs, verified by the full model for 2x latency improvement.
Rules
- VLM image resolution impacts quality — higher is better but slower.
- CLIP embeddings normalized for cosine similarity in retrieval.
- Multimodal RAG: separate embedding spaces for text and images, late fusion.
- Frame sampling for video: 1 FPS or keyframe extraction — not all frames.
- Prompt VLMs with explicit image tags like
<image>in the correct format. - Captioning: use specific prompts ("Describe in detail including colors, objects, spatial layout").
- VQA: provide explicit answer format ("Answer yes or no:" / "Choose from options: A, B, C").
- Evaluation: CIDEr/BLEU for captioning, accuracy for VQA, recall@k for retrieval.
- Cost: GPT-4V is ~$0.01/image — batch and cache aggressively.
- Open-source VLMs (LLaVA, Qwen-VL) run on A100s with ~8-16GB VRAM.
References
- references/audio-models.md — Audio Models
- references/multimodal-advanced.md — Multimodal Advanced Topics
- references/multimodal-deployment.md — Multimodal Deployment
- references/multimodal-evaluation.md — Multimodal Evaluation
- references/multimodal-fundamentals.md — Multimodal Fundamentals
- references/multimodal-rag.md — Multimodal RAG
- references/vision-language-models.md — Vision-Language Models
- references/vision-models.md — Vision Models
Handoff
For text-only RAG, hand off to ai-rag-patterns. For embedding model selection, hand off to ai-embeddings. For fine-tuning vision-language models, hand off to ai-model-training.