tao-finetune-cosmos-embed

star 1.2k

Cosmos-Embed1 video-text embedding for text-to-video retrieval, video-to-video search, semantic deduplication, and fine-tuning. Use when the user asks to "fine-tune Cosmos-Embed1", "run cosmos-embed inference", "export Cosmos-Embed1", "embed videos", or "search videos with text".

NVIDIA By NVIDIA schedule Updated 6/8/2026

name: tao-finetune-cosmos-embed description: >- Cosmos-Embed1 video-text embedding for text-to-video retrieval, video-to-video search, semantic deduplication, and fine-tuning. Use when the user asks to "fine-tune Cosmos-Embed1", "run cosmos-embed inference", "export Cosmos-Embed1", "embed videos", or "search videos with text". license: Apache-2.0 compatibility: Requires docker + nvidia-container-toolkit, the published Cosmos-Embed TAO container from versions.yaml, and a HuggingFace token when downloading pretrained nvidia/Cosmos-Embed1-* weights. metadata: author: NVIDIA Corporation version: "0.1.0" allowed-tools: Read Bash tags:

  • video
  • vision-language
  • vlm
  • multimodal
  • retrieval
  • embedding
  • cosmos
  • fine-tuning

Cosmos-Embed

Cosmos-Embed1 is a joint video-text embedder for text-to-video retrieval, video-to-video search, zero-shot/kNN classification, and semantic deduplication. The packaged CLI is cosmos-embed1 and supports train, evaluate, inference, and export.

Container image and per-action commands are in references/skill_info.yaml. Compact starting specs are in references/spec_template_*.yaml.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read references/skill_info.yaml and resolve the run override from either an explicit automl_policy value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as automl_policy: off for this run only; otherwise default to auto. When automl_policy: auto, automl_enabled: true, and both schemas/train.schema.json and references/spec_template_train.yaml are packaged, route the train action through tao-skill-bank:tao-run-automl by default with this model's skill_dir. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and automl_policy. Use direct model training only when automl_policy: off or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as evaluate, inference, export, and deploy flows stay in this model skill. The per-run automl_policy override does not change model metadata.

Quick Start

Use the published Cosmos-Embed container declared by references/skill_info.yaml and resolved through versions.yaml. Do not build from the private Cosmos-Embed1 source tree for normal skill use; build from source only when developing the container itself.

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"

Expected local workspace layout:

workspace/
├── data/
│   ├── msrvtt_test_1k.json
│   └── video/
│       ├── video7020.mp4
│       └── ...
├── model/
│   └── Cosmos-Embed1-224p/        # optional if using HF repo id
├── specs/
│   ├── train.yaml
│   ├── evaluate.yaml
│   ├── inference.yaml
│   ├── export_onnx.yaml
│   └── export_hf.yaml
└── results/

Use these Docker options for all actions unless the local Docker/platform skill gives a stricter environment-specific command:

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
  --rm --gpus all --ipc=host --network=host
  --shm-size=64g
  --ulimit memlock=-1
  --ulimit stack=67108864
  -e HF_TOKEN
  -v "$RUN_ROOT/data:/data:ro"
  -v "$RUN_ROOT/model:/model"
  -v "$RUN_ROOT/specs:/specs:ro"
  -v "$RUN_ROOT/results:/results"
)

Train:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 train -e /specs/train.yaml results_dir=/results

Evaluate:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results

Inference:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 inference -e /specs/inference.yaml \
  'inference.query.input_texts=["a man is singing on stage"]' \
  inference.k=5 \
  results_dir=/results

Export ONNX:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_onnx.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
  results_dir=/results

Export HuggingFace format:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_hf.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
  results_dir=/results

Smoke Overrides

For a small functional check, keep the same specs and override the expensive knobs:

train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0

If no local Cosmos-Embed1 pretrained checkpoint or HuggingFace token is available, set model.pretrained_model_path=null for a plumbing-only smoke train. The model quality is meaningless in that mode, but the train/evaluate/inference/export action paths can still be exercised.

For evaluation and inference smoke tests on a tiny subset:

evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0

Data Format

The MSR-VTT path expects a local video glob and a JSON metadata file:

dataset:
  train_dataset:
    dataset_type: msrvtt
    mp4_urls: /data/video/*.mp4
    metadata: /data/msrvtt_test_1k.json

List-format metadata rows must include at least video and caption:

{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}

The dataset loader derives the video id from the local .mp4 filename and filters to videos present in the metadata. If a run finds zero videos, check that mp4_urls points to a container-local glob and that metadata video names match the filenames.

Model Weights

  • Local HF directory: mount it under /model and set model.pretrained_model_path=/model/Cosmos-Embed1-224p.
  • HuggingFace repo: set model.pretrained_model_path=nvidia/Cosmos-Embed1-224p and pass HF_TOKEN if access is gated.
  • Fine-tuned checkpoint: downstream actions default to /results/train/cosmos_embed1_model_latest.pth.

Variants:

Variant Resolution Frames Embedding dim
Cosmos-Embed1-224p 224 x 224 8 256
Cosmos-Embed1-336p 336 x 336 8 768
Cosmos-Embed1-448p 448 x 448 8 768

Keep model.network.embed_dim, model.input_hw, and model.network.spatial_resolution aligned with the selected variant.

Important Parameters

Parameter Notes
train.num_gpus 1 for single GPU, >1 auto-launches torchrun, -1 auto-detects visible GPUs.
train.max_iter Main training length. Use 1 only for smoke testing.
train.optim.optim fused_adamw is faster when available; adamw is safer for smoke and portability.
model.lora.enabled Enables LoRA. Set model.network.visual_encoder.transformer_engine=false when LoRA is on.
model.lora.lora_rank LoRA rank. Start with 8; try 4, 8, or 16 for manual or AutoML-style sweeps.
model.lora.lora_alpha LoRA scaling factor. Start with 16; keep near 2 * lora_rank unless experiments show otherwise.
model.lora.lora_dropout LoRA dropout. Start with 0.1; sweep 0.0, 0.05, and 0.1 for small datasets.
model.lora.bias Bias policy: none, all, or lora_only. Keep none unless intentionally training biases.
model.lora.use_rslora / use_dora Optional LoRA variants. Enable one at a time and record the setting with the checkpoint.
model.lora.target_modules Optional module-name patterns for LoRA injection. Leave empty for the default ViT + Q-Former attention/MLP targets.
model.lora.modules_to_save Optional modules to keep fully trainable alongside LoRA. Leave empty unless preserving a task-specific head.
evaluate.load_dataset_pkl / save_dataset_pkl Cache evaluation embeddings.
inference.load_dataset_pkl / save_dataset_pkl Cache the search database for repeated retrieval.
export.mode video, text, combined, or huggingface.
export.on_cpu Recommended for export to avoid device mismatch issues.

LoRA and AutoML Notes

For parameter-efficient fine-tuning, set model.lora.enabled=true and keep model.network.visual_encoder.transformer_engine=false; TAO Core's Cosmos-Embed1 config notes that PEFT cannot inject adapters into Transformer Engine layers. Treat the LoRA fields above as the first candidate parameters for manual tuning or AutoML-style search before unfreezing larger model blocks. Avoid changing target_modules or modules_to_save unless the user explicitly needs custom adapter placement.

S3 Staging

The Cosmos-Embed1 CLI consumes local paths and Python globs, not raw s3://.../*.mp4 URIs. For S3-backed runs, first stage a subset or full dataset to the execution host/container filesystem, then use local paths such as /data/video/*.mp4 in the spec.

Recommended S3 layout for staged MSR-VTT data:

s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
    ├── video7020.mp4
    └── ...

After downloading/syncing that prefix into the mounted data/ directory, use the same Docker commands above.

Outputs

results/
├── train/
│   ├── cosmos_embed1_model_latest.pth
│   ├── cosmos_embed1_model_<iter>.pth
│   └── experiment.yaml
├── evaluate/
│   ├── metrics.json
│   └── experiment.yaml
├── inference/
│   ├── results.json
│   └── experiment.yaml
├── export/
│   ├── cosmos_embed1_combined.onnx
│   └── export_config.yaml
└── export_hf/
    └── cosmos_embed1_hf/

Known Pitfalls

Symptom Cause Fix
MSRVTTDataset: 0 videos found mp4_urls is not a local glob or metadata filenames do not match videos. Mount data into the container and set mp4_urls=/data/video/*.mp4.
HF download/auth failure Missing or invalid HF_TOKEN, or model agreement not accepted. Accept the model terms and pass -e HF_TOKEN.
LoRA injection failure Transformer Engine visual encoder is enabled. Set model.network.visual_encoder.transformer_engine=false.
ONNX/HF export complains about missing components Export checkpoint is partial or adapter-only. Use a full checkpoint or configure pretrained visual/text sources before export.
CUDA OOM Batch/resolution too high for the GPU. Reduce batch size, use 224p, enable LoRA, or use more GPUs.
Install via CLI
npx skills add https://github.com/NVIDIA/skills --skill tao-finetune-cosmos-embed
Repository Details
star Stars 1,155
call_split Forks 142
navigation Branch main
article Path SKILL.md
More from Creator