name: clipcannon
description: >
AI-powered video understanding, editing, voice synthesis, and real-time voice agent via 51 MCP
tools. 22-stage analysis pipeline with 5 embedding spaces (SigLIP, Nomic, Wav2Vec2, WavLM,
ECAPA-TDNN). Declarative EDL editing with adaptive captions, face-tracking crop, split-screen,
PIP, canvas compositing, motion effects. Voice cloning (Qwen3-TTS 1.7B), lip-sync avatars
(LatentSync 1.6), AI music (ACE-Step), text-to-video generation. Voice Agent ("Jarvis") with
wake-word ASR + local LLM. 7 platform profiles (TikTok, Reels, Shorts, YouTube, YouTube 4K,
Facebook, LinkedIn). Tamper-evident SHA-256 provenance chain. 100% local GPU. Use when the user
says "edit this video", "find the best moments", "create a highlight reel", "add captions",
"clone voice", "lip sync", "render for TikTok", "talk to Jarvis".
version: 0.1.0
author: ChrisRoyse
repo: https://github.com/JLMA-Agentic-Ai/jlma-clipcannon
mcp_server: true
protocol: stdio
entry_point: clipcannon serve
tags:
- video
- editing
- voice-cloning
- lip-sync
- transcription
- voice-agent
- ai-music
- text-to-video
- mcp
- gpu
env_vars:
- CLIPCANNON_DATA_DIR
- CLIPCANNON_GPU_DEVICE
- CLIPCANNON_NVENC
ClipCannon -- AI Video Editor via MCP
Turns Claude into a professional video editor. Ingest video, run a 22-stage AI analysis DAG, then use 51 MCP tools across 12 categories to find moments, create edits, render platform-ready clips, generate music, clone voices, produce lip-synced talking-head videos, and converse via a real-time voice agent. 14 ML models, 5 embedding spaces, 626 tests. Everything runs locally on GPU.
When to Use This Skill
- Video editing: "edit this video", "cut the boring parts", "create a highlight reel"
- Content discovery: "find the most emotional moments", "find where they talk about X"
- Platform rendering: "render for TikTok", "create Instagram Reels version"
- Voice: "clone this speaker's voice", "generate narration", "lip sync"
- Audio: "add background music", "generate sound effects", "compose a score"
- Analysis: "transcribe this video", "who are the speakers?", "scene breakdown"
- Text-to-video: "generate a video from this script" (end-to-end voice + lip-sync)
- Voice Agent: "talk to Jarvis", real-time conversational AI with wake-word activation
When Not to Use
- For simple video format conversion -- use
ffmpeg-processing
- For AI image generation -- use
comfyui or art
- For agentic video production from scratch -- use
open-montage
- For meeting transcription -- use
echoloop
- For audio-only processing -- use
ffmpeg-processing
Architecture
+-----------------+
| AI Assistant | (Claude, etc.)
| (MCP Client) |
+--------+--------+
| MCP Protocol (stdio)
+--------v--------+
| ClipCannon |
| MCP Server | 51 tools / 12 categories
+--------+--------+
|
+------------------+------------------+
| | |
+-------v------+ +-------v------+ +-------v-------+
| Analysis | | Editing | | Voice/Avatar |
| Pipeline | | + Rendering | | Engine |
| (22 stages) | | (FFmpeg + | | (Qwen3-TTS + |
| | | NVENC) | | LatentSync) |
| 5 embedding | | 7 profiles | | ECAPA-TDNN |
| spaces | | ASS captions | | verification |
| sqlite-vec | | Smart crop | | Resemble |
| | | Canvas comp | | Enhance |
+--------------+ +--------------+ +---------------+
|
+--------v--------+
| SQLite + vec | Per-project DB
| (analysis.db) | 4 vector tables
+-----------------+ 31 core tables
Separate processes:
+------------------+ +------------------+ +------------------+
| License Server | | Dashboard | | Voice Agent |
| (port 3100) | | (port 3200) | | ("Jarvis") |
| HMAC billing | | Web UI | | Wake word + ASR |
| Stripe webhooks | | Projects/Credits | | + LLM + TTS |
+------------------+ +------------------+ +------------------+
MCP Tools (51 across 12 categories)
Project (5)
| Tool |
Description |
clipcannon_project_create |
Create a new project |
clipcannon_project_open |
Open an existing project |
clipcannon_project_list |
List all projects |
clipcannon_project_status |
Get project analysis status |
clipcannon_project_delete |
Delete a project |
Understanding (4)
| Tool |
Description |
clipcannon_ingest |
Ingest video, run 22-stage analysis pipeline |
clipcannon_get_transcript |
Get full transcript with timestamps |
clipcannon_get_frame |
Extract specific frame as image |
clipcannon_search_content |
Semantic search across all 5 embedding spaces |
Discovery (4)
| Tool |
Description |
clipcannon_find_best_moments |
AI-ranked highlight moments |
clipcannon_find_cut_points |
Optimal cut points for editing |
clipcannon_get_narrative_flow |
Narrative structure and flow analysis |
clipcannon_find_safe_cuts |
Find edit-safe cut points |
Editing (11)
| Tool |
Description |
clipcannon_create_edit |
Create declarative EDL edit |
clipcannon_modify_edit |
Modify existing edit |
clipcannon_auto_trim |
Auto-trim dead space |
clipcannon_color_adjust |
Colour correction |
clipcannon_add_motion |
Motion effects (ken burns, zoom, pan) |
clipcannon_add_overlay |
Add overlay/watermark |
clipcannon_apply_feedback |
Apply review feedback to edit |
clipcannon_branch_edit |
Branch edit for A/B versions |
clipcannon_edit_history |
View edit revision history |
clipcannon_revert_edit |
Revert to previous edit version |
| (adaptive captions, face-tracking crop, split-screen, PIP, canvas compositing) |
|
Rendering (8)
| Tool |
Description |
clipcannon_render |
Render final output (7 platform profiles) |
clipcannon_preview_clip |
Preview at 540p (free, no credits) |
clipcannon_preview_layout |
Preview layout/composition |
clipcannon_inspect_render |
Inspect render output quality |
clipcannon_get_scene_map |
Get scene map with timestamps |
clipcannon_get_editing_context |
Get editing context for a segment |
clipcannon_analyze_frame |
Analyse specific frame |
| (NVENC GPU acceleration, 7 profiles: TikTok, Reels, Shorts, YouTube, YouTube 4K, Facebook, LinkedIn) |
|
Audio (4)
| Tool |
Description |
clipcannon_generate_music |
ACE-Step diffusion music generation |
clipcannon_compose_midi |
6 MIDI presets with FluidSynth |
clipcannon_generate_sfx |
9 DSP sound effects |
clipcannon_audio_cleanup |
Noise reduction, normalisation, speech-aware ducking |
Voice (4)
| Tool |
Description |
clipcannon_prepare_voice_data |
Prepare voice data for cloning |
clipcannon_voice_profiles |
List/manage voice profiles |
clipcannon_speak |
Generate speech with cloned voice (Qwen3-TTS 1.7B) |
clipcannon_speak_optimized |
Best-of-N optimised speech with verification |
Avatar (1)
| Tool |
Description |
clipcannon_lip_sync |
LatentSync 1.6 (ByteDance) diffusion lip-sync avatar |
Video Gen (1)
| Tool |
Description |
clipcannon_generate_video |
End-to-end text to voice to lip-sync video |
Billing (4)
| Tool |
Description |
clipcannon_credits_balance |
Check credit balance |
clipcannon_credits_history |
Transaction history |
clipcannon_credits_estimate |
Estimate cost for operation |
clipcannon_spending_limit |
Set/view spending limits |
Disk (2)
| Tool |
Description |
clipcannon_disk_status |
Disk usage per project |
clipcannon_disk_cleanup |
Clean up old renders/cache |
Config (3)
| Tool |
Description |
clipcannon_config_get |
Get config value |
clipcannon_config_set |
Set config value |
clipcannon_config_list |
List all config settings |
Voice Agent ("Jarvis")
Real-time conversational AI with wake-word activation. All local, zero cloud.
# Recommended: Pipecat + Ollama (all local)
python -m voiceagent talk --voice boris
# WebSocket server for remote clients
python -m voiceagent serve --port 8765
Lifecycle: DORMANT (CPU only, wake word listening) -> LOADING (~10-20s) -> ACTIVE (full conversation, ~30 GB VRAM) -> DORMANT
Components: Whisper Large v3 ASR, Qwen3-14B FP8 local LLM (120 tok/s), faster-qwen3-tts 0.6B (500ms TTFB), Silero VAD, "Hey Jarvis" wake word.
Pauses other GPU workers on activation and resumes them on deactivation to share VRAM on a single GPU.
14 ML Models
| Model |
Provider |
Purpose |
VRAM |
| SigLIP-SO400M |
Google |
Visual embeddings + shot classification |
~2 GB |
| Nomic Embed v1.5 |
Nomic AI |
Semantic text embeddings |
~1 GB |
| Wav2Vec2-large |
Meta |
Emotion embeddings |
~2 GB |
| WavLM-base-plus-sv |
Microsoft |
Speaker diarisation |
~1 GB |
| WhisperX Large v3 |
OpenAI |
Speech-to-text |
~3 GB |
| HTDemucs v4 |
Meta |
Audio source separation |
~2 GB |
| Qwen3-8B |
Qwen |
Narrative analysis |
~8 GB |
| Qwen3-TTS 1.7B |
Qwen |
Voice cloning (video) |
~4 GB |
| faster-qwen3-tts 0.6B |
Qwen |
Voice Agent (real-time) |
~4 GB |
| LatentSync 1.6 |
ByteDance |
Lip-sync avatars |
~4 GB |
| ACE-Step v1.5 |
ACE |
AI music generation |
~4 GB |
| SenseVoice Small |
FunASR |
Reaction detection |
~1 GB |
| Silero VAD |
Silero |
Voice activity detection |
CPU |
| PaddleOCR v5 |
PaddlePaddle |
On-screen text detection |
~1 GB |
Models loaded on-demand with LRU eviction. GPUs with >16 GB run models concurrently; smaller GPUs load sequentially. Auto-detects GPU precision: Blackwell (nvfp4), Ada Lovelace (int8), Ampere (int8), Turing (fp16), CPU (fp32).
5 Embedding Spaces
| Space |
Model |
Dimensions |
Use |
| Visual |
SigLIP-SO400M |
1152 |
Scene similarity, visual search |
| Semantic |
Nomic Embed v1.5 |
768 |
Transcript/meaning search |
| Emotion |
Wav2Vec2-large |
1024 |
Emotional moment detection |
| Speaker |
WavLM-base-plus-sv |
512 |
Speaker diarisation |
| Voice ID |
ECAPA-TDNN |
2048 |
Voice cloning verification |
All stored in sqlite-vec for local KNN search. Per-project SQLite database with 31 core tables + 4 vector tables.
Credit System
| Operation |
Credits |
| Analyze (ingest) |
10 |
| Render |
2 |
| Preview |
0 |
| Metadata |
1 |
Dev mode starts with 100 credits. Production billing via Stripe webhooks. HMAC-signed balance with spending limits and transaction history.
Setup
# Install (requires Python 3.12+, CUDA GPU, 8+ GB VRAM minimum, 24+ GB recommended)
pip install clipcannon
# Or from source
cd /tmp && git clone https://github.com/JLMA-Agentic-Ai/jlma-clipcannon.git
cd jlma-clipcannon && pip install -e .
# Install ML dependencies
pip install -e ".[ml]"
# Install Phase 2 audio/video
pip install -e ".[phase2]"
# Start MCP server
clipcannon serve
# Docker
cd config && docker compose up -d
# Dashboard: http://localhost:3200 | License server: http://localhost:3100
Environment Variables
| Variable |
Default |
Description |
CLIPCANNON_DATA_DIR |
~/.clipcannon |
Data/model storage directory |
CLIPCANNON_GPU_DEVICE |
cuda:0 |
GPU device for inference |
CLIPCANNON_NVENC |
true |
Use NVENC GPU encoding for renders |
Integration with Other Skills
| Skill |
Relationship |
open-montage |
OpenMontage produces from scratch; ClipCannon edits existing footage |
ffmpeg-processing |
ClipCannon uses FFmpeg internally; the skill is for standalone conversions |
echoloop |
EchoLoop captures meeting audio; ClipCannon edits the resulting video |
notebooklm |
Feed video transcripts as NotebookLM sources for study materials |
art |
Generate thumbnails/overlays via Nano Banana 2 |
comfyui |
Generate AI video segments to splice into edits |
Provenance
SHA-256 hash chain links every pipeline operation. Every output is traceable to its source. Tamper-evident provenance chain stored in per-project SQLite database.
Attribution
ClipCannon by Chris Royse. BSL 1.1 License. Repo: https://github.com/JLMA-Agentic-Ai/jlma-clipcannon