name: content-engine-opencaptions category: broomva version: "1.0.0" kind: python slug: opencaptions hook: post-production description: "OpenCaptions extension for Content Engine — adds intent-driven CWI (Caption With Intent) captions to the post-production pipeline. Hooks into the grade → caption → final stage. Generates captions that understand video intent (pitch, volume, emotion, emphasis) and style themselves accordingly with variable font weight, size, and color. Uses the OpenCaptions CLI or MCP server. Triggers on: 'add captions', 'opencaptions', 'CWI captions', 'intent captions'." inputs: - "graded video asset from post-production pipeline" - "compiled/brands/{brand_slug}.md (optional)" - "scene brief (passed at runtime)" outputs: - "extensions/opencaptions/output/{asset_id}.cwi.json" - "extensions/opencaptions/output/{asset_id}.vtt" - "extensions/opencaptions/output/{asset_id}-report.json" - "extensions/opencaptions/output/{asset_id}-captioned.mp4 (Remotion render)" dependencies: []
OpenCaptions Extension
Hook CWI (Caption With Intent) captions into the Content Engine post-production pipeline. Instead of flat, static subtitles, CWI encodes the felt experience of speech — pitch drives font weight, volume drives font size, emphasis triggers glow animations, and speaker identity maps to WCAG AA colors. The result is captions that communicate how something was said, not just what was said.
CWI is the Oscar-winning (2025) captioning standard by FCB Chicago and Chicago Hearing Society. OpenCaptions is the first open-source programmatic toolchain for it.
Pipeline Hook Point
Stage: Post-Production (Stage 3) Position: After upscale and color grade, before final export.
Post-Production Chain:
1. Upscale (Real-ESRGAN / Topaz) --> Higher resolution
2. Color Grade (Brand LUT / ffmpeg) --> On-brand color
3. OpenCaptions (CWI) --> Intent-driven captions
4. Final Export (ffmpeg) --> Production-ready file
The extension receives a graded video file and produces:
- A CWI document (
.cwi.json) with word-level intent annotations - A WebVTT fallback (
.vtt) for platforms that need standard subtitles - A validation report (JSON) with pillar scores
- A captioned video (
.mp4) rendered via Remotion or ffmpeg burn-in
How It Works
Graded Video + Scene Brief
|
├── [1] Transcribe (faster-whisper) --> word-level timestamps
├── [2] Diarize (pyannote-audio) --> speaker attribution
├── [3] Extract Intent --> pitch, volume, emotion, emphasis per utterance
| V1: parselmouth + FER (audio+vision)
| V2: V-JEPA2 world model
| V3: TRIBE v2 neural prediction
├── [4] Map Intent --> CWI --> weight, size, emphasis, animation per word
| V1: RulesMapper (pitch-->weight, volume-->size)
| V2: LearnedMapper (trained on correction data)
| V3: NeuralMapper (brain-region activations drive styling)
├── [5] Validate (12 rules) --> Attribution + Synchronization + Intonation scores
└── [6] Render --> Remotion overlay OR ffmpeg burn-in
Step 1-2: Transcription + Diarization
The pipeline calls the OpenCaptions CLI or MCP server, which runs faster-whisper for word-level timestamps and pyannote-audio for speaker detection. These stages produce a DiarizedTranscript with each word tagged to a speaker.
Step 3: Intent Extraction
The intent extractor analyzes audio and video to determine the felt register of each utterance:
- Vocal signals: Pitch (Hz), volume (dB), speech rate (WPM), pause timing
- Affect signals: Valence (-1 to 1), arousal (0 to 1), dominant emotion
- Semantic signals: Sarcasm probability, emphasis words, rhetorical devices
These signals form an IntentFrame for each utterance. Individual words that deviate from the utterance baseline receive WordIntent overrides.
Step 4: Intent Mapping
The mapper converts intent signals to CWI visual parameters:
| Intent Signal | CWI Property | Mapping (V1 RulesMapper) |
|---|---|---|
| Pitch (normalized 0-1) | weight (100-900) |
Linear interpolation: 100 + pitch_norm * 800 |
| Volume (normalized 0-1) | size (0.7-1.5) |
Linear interpolation: 0.8 + volume_norm * 0.55 |
| Semantic emphasis + high volume | emphasis (boolean) |
Word in emphasis_words OR volume > p90 |
| Speaker identity | Speaker color |
Assigned from 12-color WCAG AA palette |
| Whisper (low volume) | weight < 300 |
Override: weight clamped to 100-250 |
| Shout (high volume) | size > 1.3 + emphasis |
Override: size 1.35-1.5, emphasis true |
Step 5: Validation
The @opencaptions/spec validator checks the CWI document against 12 rules across three pillars:
Attribution (3 rules):
- ATT_001: Every caption has a speaker in the cast
- ATT_002: Speakers have unique colors
- ATT_003: Colors meet WCAG AA contrast (4.5:1 against #1a1a1a)
Synchronization (5 rules):
- SYN_001: All words have timestamps
- SYN_002: Timestamps monotonically increasing
- SYN_003: Caption events do not overlap
- SYN_004: Animation duration 600ms per spec
- FCC_001: No gaps > 3s during speech
- FCC_002: Max 42 chars per line (FCC compliant)
Intonation (3 rules):
- INT_001: Weight in valid range (100-900)
- INT_002: Size in valid range (0.7-1.5)
- INT_003: > 20% of words have non-default weight (captions actually vary)
Each pillar scores 0-100. A CWI document passes if all three pillars score >= 80.
Step 6: Rendering
Two rendering paths:
Remotion path (preferred for Content Engine): CWI words become React components in a <CaptionOverlay> composition. Each word appears at its timestamp with spring() entrance animation, speaker color, variable font weight, and emphasis glow. See references/cwi-remotion-bridge.md for the full mapping.
ffmpeg burn-in (fallback): Generate an ASS subtitle file from the CWI document and burn it into the video via ffmpeg -vf ass=captions.ass. Loses per-word animation but preserves speaker colors and font weight variation.
CLI Usage
# Generate CWI captions from a video
npx opencaptions generate video.mp4 --output captions.cwi.json
# Generate with TRIBE v2 neural backend
npx opencaptions generate video.mp4 --backend tribe --output captions.cwi.json
# Validate a CWI document
npx opencaptions validate captions.cwi.json
# Preview captions in the terminal (ANSI-colored)
npx opencaptions preview captions.cwi.json
# Export to WebVTT
npx opencaptions export captions.cwi.json --format webvtt
# Export to After Effects ExtendScript
npx opencaptions export captions.cwi.json --format ae-json
# Export to Premiere Pro XML
npx opencaptions export captions.cwi.json --format premiere-xml
# Check system dependencies
npx opencaptions doctor
# Install Python dependencies
npx opencaptions setup
MCP Server Usage
The OpenCaptions MCP server exposes four tools for agent integration:
| Tool | Description |
|---|---|
generate_captions |
Run the full pipeline on a video file |
validate_captions |
Validate a CWI document against 12 rules |
preview_captions |
Terminal-formatted preview at a given time |
export_captions |
Export to WebVTT format |
To use via Claude Code, add the MCP server to your settings:
{
"mcpServers": {
"opencaptions": {
"command": "npx",
"args": ["@opencaptions/mcp"]
}
}
}
Then invoke in conversation: "Generate CWI captions for video.mp4" triggers generate_captions.
compose-video.py Integration
The Content Engine's compose-video.py accepts a --captions flag:
python3 scripts/compose-video.py storyboard.md --brand arcan-studio --captions
When --captions is passed, after all clips are generated and stitched, the pipeline:
- Runs
npx opencaptions generate {final-video}.mp4 --output {output-dir}/captions.cwi.json - Validates the CWI document (warns if score < 80 but does not block)
- If
--remotionis also passed, injects the CWI JSON into the Remotion composition as<CaptionOverlay>props - Otherwise, generates WebVTT as a sidecar file alongside the final video
If OpenCaptions is not installed (npx opencaptions doctor fails), the pipeline degrades gracefully:
- Falls back to scene brief text overlay (static text from the storyboard
## Shotdescriptions) - Logs a warning: "OpenCaptions not available, using scene brief text overlay"
- Still produces a
.vttfile with timing derived from clip boundaries
Caption Styles
Three built-in caption style presets, selectable via configuration:
1. Word-by-Word (brainrot)
High-retention style: each word appears individually, centered, maximum size variation. Used for social/TikTok/Reels content optimized for dopamine hooks.
- One word visible at a time
- Large base size (64px at 1080p)
- Extreme weight variation (100-900 full range)
- Emphasis words scale to 1.5x with glow
- Background: none (words float over video)
- Font: Roboto Flex variable
2. Narrative (editorial)
Full-sentence display with per-word styling. Used for long-form content, documentaries, blog video embeds.
- Full caption event visible, words styled inline
- Medium base size (48px at 1080p)
- Moderate weight variation (300-700)
- Emphasis words get subtle size bump (1.15x)
- Background: semi-transparent pill (rgba(0,0,0,0.6))
- Font: Roboto Flex variable
3. Minimal (lower-third)
Clean lower-third presentation. Speaker name + text. Minimal visual distraction. Used for interviews, talking heads, conference recordings.
- Speaker name label in speaker color, caption text in white
- Small base size (36px at 1080p)
- Weight variation only for emphasis (400 default, 700 emphasis)
- No size variation
- Background: gradient bar (bottom 15% of frame)
- Font: system sans-serif
Data Contract
Reads (immutable)
- Graded video asset from prior post-production step
knowledge/compiled/brands/{brand_slug}.md-- for caption color palette and typography preferences- Scene brief (runtime parameter) -- for context on intended emotional tone
Writes (own namespace only)
extensions/opencaptions/output/{asset_id}.cwi.json-- CWI documentextensions/opencaptions/output/{asset_id}.vtt-- WebVTT fallbackextensions/opencaptions/output/{asset_id}-report.json-- validation reportextensions/opencaptions/output/{asset_id}-captioned.mp4-- rendered video (Remotion path only)
Configuration
opencaptions:
# Caption style preset
style: "narrative" # word-by-word | narrative | minimal
# Visual parameters
font_family: "Roboto Flex" # Variable font for weight axis
base_font_size: 48 # Pixels at 1080p
position: "bottom" # top | center | bottom
max_chars_per_line: 42 # FCC compliance
background: "pill" # pill | shadow | gradient | none
background_opacity: 0.6 # 0.0 - 1.0
# Pipeline backend
backend: "av" # av (V1) | jepa (V2) | tribe (V3)
language: "en" # ISO 639-1 hint for transcription
# Speaker overrides (optional, auto-detected by default)
speakers:
- name: "Narrator"
color: "#6B8AFF"
- name: "Guest"
color: "#FF6B6B"
# Rendering
render_mode: "remotion" # remotion | ffmpeg | sidecar-only
output_format: "mp4" # mp4 | webm
fps: 30 # Match source video
# Validation
min_score: 80 # Minimum overall score to proceed
fail_on_low_score: false # If true, block pipeline on low score
# TRIBE v2 neural mapping (V3 backend only)
neural_mapping: false # Enable predicted neural response styling
Graceful Degradation
The extension never blocks the Content Engine pipeline. Degradation ladder:
| Condition | Behavior |
|---|---|
| OpenCaptions CLI installed + all Python deps | Full CWI pipeline (transcription, diarization, intent, mapping) |
| OpenCaptions CLI installed, no Python deps | Mock backends with placeholder captions + validation |
| OpenCaptions CLI not installed | Scene brief text overlay as static captions at clip boundaries |
| Video has no audio track | Skip caption generation, log info message |
| Validation score < 80 | Warn but proceed (unless fail_on_low_score: true) |
| Remotion not available | Fall back to ffmpeg ASS burn-in |
| ffmpeg not available | Output CWI JSON + WebVTT as sidecar only (no rendered video) |
Related
- OpenCaptions project:
~/broomva/apps/opencaptions/ - CWI-Remotion bridge:
references/cwi-remotion-bridge.md - Integration guide:
references/integration-guide.md - Content Engine SKILL.md:
../../SKILL.md - Extension framework:
../README.md