ltx-video - SKILL.md Agent Skill

name: ltx-video description: > Set up, configure, and run LTX-2/LTX-2.3 (Lightricks) for AI video and audio generation. Use when: (1) Installing LTX-2 locally or via ComfyUI, (2) Generating video from text, image, audio, or keyframes, (3) Running inference with the 22B dev/distilled models, (4) Fine-tuning with LoRA/IC-LoRA, (5) Configuring upscalers (spatial/temporal), (6) Optimizing VRAM usage (FP8, quantization, low-VRAM modes), (7) Building content-generation pipelines with LTX-2, (8) Writing prompts for video generation, (9) Integrating LTX-2 into ComfyUI workflows. Triggers on: ltx, ltx-video, ltx-2, lightricks video, text-to-video generation, video diffusion model.

LTX-2 Video Generation

LTX-2.3 is the first DiT-based audio-video foundation model — 22B parameters, synchronized audio+video, real-time capable. Developed by Lightricks.

Paper: arXiv:2601.03233
Repo: https://github.com/Lightricks/LTX-2
Models: https://huggingface.co/Lightricks/LTX-2.3
ComfyUI: https://github.com/Lightricks/ComfyUI-LTXVideo

Quick Setup

# Clone and install (requires Python >=3.12, CUDA >12.7, uv)
git clone https://github.com/Lightricks/LTX-2.git
cd LTX-2
uv sync --frozen
source .venv/bin/activate

Or run scripts/setup-ltx.sh for automated setup with model downloads.

Model Selection

Model	Steps	Speed	Quality	Min VRAM
`ltx-2.3-22b-dev`	40	Baseline	Highest	24 GB
`ltx-2.3-22b-distilled`	8	~3x faster	Very good	16 GB
`ltx-2.3-22b-distilled` + FP8	8	~3x faster	Very good	12 GB

Supporting models (download alongside):

ltx-2.3-spatial-upscaler-x2-1.1 — 2x resolution enhancement
ltx-2.3-spatial-upscaler-x1.5-1.0 — 1.5x resolution (less VRAM)
ltx-2.3-temporal-upscaler-x2-1.0 — doubles FPS (smoother motion)
ltx-2.3-22b-distilled-lora-384 — distilled LoRA for dev model
Gemma 3 text encoder (12B) — required for all pipelines

Pipelines

Choose based on use case:

Pipeline	Use Case	Notes
`TI2VidTwoStagesPipeline`	Default production — text/image to video	2x spatial upsampling, best quality
`TI2VidTwoStagesHQPipeline`	Higher quality, fewer steps needed	Second-order sampling
`TI2VidOneStagePipeline`	Quick prototyping	Lower res, faster
`DistilledPipeline`	Fastest inference	8 fixed steps, CFG=1
`ICLoraPipeline`	Video/image transformations	Style transfer, control
`KeyframeInterpolationPipeline`	Animate between keyframes	Start/end frame interpolation
`A2VidPipelineTwoStage`	Audio-conditioned video	Synced audio+video
`RetakePipeline`	Regenerate video regions	Selective re-generation

Running Inference

CLI

# Text-to-video (recommended two-stage pipeline)
python -m ltx_pipelines.run \
  --config configs/ltx-2.3-22b-dev-2stage.yaml \
  --prompt "A golden retriever running through autumn leaves in a park" \
  --height 704 --width 1216 \
  --num_frames 97 \
  --output output.mp4

# Image-to-video
python -m ltx_pipelines.run \
  --config configs/ltx-2.3-22b-dev-2stage.yaml \
  --prompt "The scene comes alive with motion..." \
  --conditioning_image path/to/image.png \
  --height 704 --width 1216 \
  --num_frames 97 \
  --output output.mp4

# Distilled (fast mode)
python -m ltx_pipelines.run \
  --config configs/ltx-2.3-22b-distilled-2stage.yaml \
  --prompt "..." \
  --output output.mp4

# FP8 quantization for lower VRAM
python -m ltx_pipelines.run \
  --config configs/ltx-2.3-22b-dev-2stage.yaml \
  --quantization fp8-cast \
  --prompt "..." \
  --output output.mp4

Python API

from ltx_video.inference import infer, InferenceConfig

infer(InferenceConfig(
    pipeline_config="configs/ltx-2.3-22b-dev-2stage.yaml",
    prompt="A cinematic shot of ocean waves crashing on volcanic rocks at sunset",
    height=704,
    width=1216,
    num_frames=97,
    output_path="output.mp4",
    enhance_prompt=True,  # auto-enhance with Gemma 3
))

Resolution & Frame Rules

Width/height: must be divisible by 32 (e.g., 480x832, 704x1216, 768x512)
Frame count: must be 8n + 1 (e.g., 33, 65, 97, 129, 257)
Default: 1216x704 at 30 FPS
Max frames: 1,441 (long-form with distilled models)
FPS: 1-60, default 24-30

Prompting

Write detailed, chronological scene descriptions in ~200 words. Include:

Main action and specific movements
Character/object appearance details
Environment and setting
Camera angle and movement
Lighting and atmosphere
Notable transitions or changes

Use enhance_prompt=True for automatic prompt elaboration.

See references/prompting-guide.md for advanced techniques and examples.

VRAM Optimization

Technique	Savings	Flag/Config
FP8 quantization	~40%	`--quantization fp8-cast`
FP8 scaled (Hopper GPUs)	~45%	`--quantization fp8-scaled-mm`
Single-stage pipeline	~30%	Use `OneStagePipeline` config
Distilled model	Fewer steps = less peak VRAM	Use distilled config
xFormers attention	~15%	Install xformers, auto-detected
Flash Attention 3	~20%	Hopper GPUs, auto-detected
Gradient estimation	Reduce steps 40 to 20	`--gradient_estimation`

Apple Silicon (MLX)

LTX-2 runs on Apple Silicon via MLX-native ports or the MPS PyTorch backend. Q4 quantized models fit on 24 GB machines; full precision needs 64 GB+. Pin PyTorch to 2.4.1 for MPS (2.5+ has Conv3d regression). See references/apple-silicon.md for setup, patches, and memory tables.

Cloud API

Use hosted APIs when local hardware is insufficient: LTX API ($0.04-0.08/sec, official), fal.ai (~~$0.20/video), or Replicate (~~$0.10/run). For self-hosted cloud, RunPod H100 runs at ~$2.49/hr and Vast.ai from ~$0.50/hr. See references/cloud-api.md for Python examples and cost comparison.

Low-VRAM CUDA (12 GB)

Run LTX-2 on 12 GB GPUs (RTX 4070, 3060 12GB) by combining three techniques:

FP8 quantization (--quantization fp8-cast) -- reduces model memory ~40%
Distilled model (ltx-2.3-22b-distilled) -- 8 fixed steps instead of 40, lower peak VRAM
Single-stage pipeline (TI2VidOneStagePipeline or DistilledPipeline) -- avoids loading the spatial upscaler

python -m ltx_pipelines.run \
  --config configs/ltx-2.3-22b-distilled-1stage.yaml \
  --quantization fp8-cast \
  --prompt "A cat sitting on a windowsill watching rain" \
  --height 480 --width 832 \
  --num_frames 33 \
  --output output.mp4

At 480x832 with 33 frames, this fits comfortably in 12 GB. Increase resolution or frame count only if monitoring shows headroom. Adding xFormers (pip install xformers) saves another ~15% VRAM when available.

LoRA & Fine-Tuning

Train custom LoRAs in <1 hour for motion, style, or likeness:

# See packages/ltx-trainer/README.md for full training guide
cd packages/ltx-trainer
python train.py --config configs/lora_training.yaml \
  --data_dir /path/to/videos \
  --output_dir /path/to/output

Specialized control LoRAs available:

Union IC-LoRA (depth + Canny edge)
Motion tracking, Detailer, Pose control, Camera movement LoRAs

ComfyUI Integration

See references/comfyui-setup.md for full ComfyUI workflow setup.

Quick start: ComfyUI Manager → Install Custom Nodes → search "LTXVideo" → Install. Requires: CUDA GPU with 32GB+ VRAM, 100GB+ disk space.

Troubleshooting

CUDA OOM: Add --quantization fp8-cast, use distilled model, or reduce resolution
Slow generation: Use distilled pipeline (8 steps vs 40), enable xFormers
Poor quality: Write longer prompts (~200 words), enable enhance_prompt, use two-stage pipeline
Audio issues: Audio quality lower without speech; provide audio conditioning when possible
Frame count error: Ensure frames = 8n+1 (33, 65, 97, 129...)
Resolution error: Ensure width/height divisible by 32