name: thumbnail-extraction description: "Extracts the most interesting frames from video files for thumbnail compositing. Detects faces, expressions, smiles, and presentation slides. Outputs full frames, face crops, and transparent cutouts. Use when asked to extract thumbnails, find interesting frames, grab screenshots from video, or create thumbnail candidates from recordings." version: 0.1.0
Video Thumbnail Extraction
Overview
Automatically scan a local MP4 video (or YouTube URL via yt-dlp) and extract the 4 most visually interesting frames — prioritizing expressive faces (laughing, shocked, smiling) and engaging presentation slides. Outputs full frames, face crops, and background-removed transparent PNGs ready for compositing.
When to Use
- Before creating YouTube thumbnails (feeds into
youtube-thumbnailsskill) - When you need the best screenshot from a long video recording
- When compositing a thumbnail and need transparent guest cutouts
- Processing Zoom gallery recordings, interviews, or presentations
Dependencies
Python packages (install once)
# In sandbox (Cowork VM):
pip install opencv-python scenedetect deepface pillow numpy --break-system-packages
# On host Mac (for background removal — sandbox can't download the model):
pip3 install 'rembg[cpu]' pillow --break-system-packages
System tools
ffmpeg(usually pre-installed)python3(3.10+)yt-dlp(optional, for YouTube URLs):pip install yt-dlp --break-system-packages
Model downloads (first run)
- DeepFace expression model (~1MB): Downloads automatically on first use. If blocked by proxy, expression detection falls back to OpenCV smile cascade (still effective).
- rembg u2net model (~176MB): Downloads on first use. Must run on host Mac if sandbox blocks GitHub releases.
Pipeline Architecture
Two-Pass Design (memory-efficient)
Pass 1 — Quick Scan (OpenCV only, no deep learning)
- Sample a frame every 10 seconds across the video
- Skip first/last 60 seconds (intro/outro)
- For each frame:
- Detect faces via Haar cascade (fast, no GPU needed)
- Detect smiles within face regions
- Compute visual variance (proxy for "interesting" content)
- Detect presentation slides (high edge density + low color saturation)
- Score each frame based on: face count, smile count, smile size, visual variance
- Select top 12 diverse candidates using quadrant system: divide video into N time segments, pick best from each → ensures temporal spread
Pass 2 — Deep Analysis (DeepFace, only on top 12 candidates)
- Re-read only the selected frames from video
- Run DeepFace emotion detection (happy, surprise, fear, sad, angry, disgust, neutral)
- Weight emotions by thumbnail value: happy > surprise > fear > angry > sad > neutral
- Combine Pass 1 score with expression score
- Final selection: divide candidates into N time segments, pick best from each → guarantees spread across the full video
Pass 3 — Output (rembg, only on final 4 frames)
- Save full frame as JPG (95% quality)
- Crop largest detected face with generous padding (0.5x)
- Run background removal on face crop → transparent PNG
- Generate manifest JSON with metadata
Scoring Heuristics
| Signal | Weight | Notes |
|---|---|---|
| Face detected | +2.0 per face (cap 3) | Gallery views score high |
| Smile detected | +3.0 per smile | Cascade-based, no model needed |
| Smile size ratio | +5.0 × ratio | Bigger smiles = more expressive |
| Multi-person shot | +1.0 bonus | 2+ faces = engaging |
| Happy expression | +2.0 bonus (Pass 2) | Best for thumbnails |
| Surprise expression | +2.0 bonus (Pass 2) | Eye-catching |
| Fear/angry expression | +1.0 bonus (Pass 2) | "Shocked" reactions |
| Visual variance | +0.0–1.5 | Normalized by frame complexity |
| Presentation slide | baseline 1.5 | Useful for slide screenshots |
Temporal Diversity Algorithm
The pipeline enforces temporal spread to avoid clustering picks in one segment:
- Quadrant selection (Pass 1 → Pass 2): Divide video duration into N segments, pick the highest-scoring frame from each segment
- Segment-forced selection (Pass 2 → Final): Divide top candidates into
top_nequal time segments, pick best from each - Fallback: If any segment is empty, fill from overall top scores
This ensures a 76-minute video yields picks from different parts (e.g., 1:00, 2:10, 21:50, 48:50) rather than clustering in the most face-heavy section.
Usage
Command Line
python3 thumbnail_extractor.py <video_path> [output_dir] [top_n]
Arguments:
video_path— Path to MP4 file (required)output_dir— Where to save outputs (default:~/Downloads/thumb_candidates)top_n— Number of candidates to extract (default: 4)
Examples:
# Basic — extract 4 best frames
python3 thumbnail_extractor.py "GMT20260130-210038_Recording_gallery_2380x1544.mp4"
# Custom output dir and count
python3 thumbnail_extractor.py recording.mp4 ./thumbs 6
# YouTube video (download first)
yt-dlp -o "video.mp4" "https://youtube.com/watch?v=..."
python3 thumbnail_extractor.py video.mp4
Output Files
For each candidate, the pipeline generates:
| File | Format | Description |
|---|---|---|
{name}_{n}_{emotion}_{timestamp}_full.jpg |
JPG 95% | Full video frame |
{name}_{n}_{emotion}_{timestamp}_face.jpg |
JPG 95% | Cropped face with padding |
{name}_{n}_{emotion}_{timestamp}_transparent.png |
PNG w/ alpha | Background-removed face cutout |
{name}_manifest.json |
JSON | Metadata for all candidates |
Naming example: GMT20260130-210038_3_happy_2-10_full.jpg
GMT20260130-210038— video name (truncated for Zoom recordings)3— candidate number (ranked by score)happy— detected dominant emotion2-10— timestamp (2 minutes 10 seconds)full/face/transparent— file type
Manifest JSON Structure
{
"video": "GMT20260130-210038",
"candidates": [
{
"index": 1,
"timestamp": "2:10",
"timestamp_sec": 130.0,
"emotion": "happy",
"emotion_score": 0.85,
"combined_score": 12.4,
"num_faces": 3,
"is_presentation": false,
"files": {
"full": "..._full.jpg",
"face_crop": "..._face.jpg",
"transparent": "..._transparent.png"
}
}
]
}
Background Removal (Separate Step)
Since the Cowork sandbox may block model downloads, run rembg on the host Mac:
# On host Mac (via osascript or Terminal)
cd ~/Downloads/thumb_candidates
python3 -c "
from rembg import remove
from PIL import Image
import glob, os
for f in sorted(glob.glob('*_face.jpg')):
out = f.replace('_face.jpg', '_transparent.png')
print(f'Processing {f}...')
img = Image.open(f)
result = remove(img)
result.save(out)
print(f' -> {out} ({os.path.getsize(out)//1024}KB)')
"
This takes 10-15 seconds per image on Apple Silicon. The u2net model downloads automatically on first run (176MB).
Integration with Other Skills
Feeding into youtube-thumbnails
After extraction, use the transparent PNGs as compositing elements:
- Pick the best face cutout from the candidates
- Use it in the Gemini thumbnail prompt as a reference, or
- Composite it manually onto the generated Gemini background using ImageMagick:
# Composite transparent face onto Gemini-generated background
convert gemini_background.jpg transparent_face.png \
-gravity southeast -geometry +50+50 \
-composite final_thumbnail.jpg
Pipeline flow
[Video MP4] → thumbnail-extraction → [face crops + transparent PNGs]
↓
youtube-thumbnails → [Gemini background]
↓
[Composite final thumbnail]
Tuning Parameters
Edit these at the top of thumbnail_extractor.py:
| Parameter | Default | Effect |
|---|---|---|
SAMPLE_INTERVAL_SEC |
10 | Lower = more frames scanned, slower |
ANALYSIS_SCALE |
0.5 | Lower = faster face detection, less accurate |
SCENE_THRESHOLD |
27.0 | Lower = more scene boundaries detected |
MIN_FACE_CONFIDENCE |
0.80 | Higher = fewer false positive faces |
top_n |
4 | Number of final candidates |
For short videos (<10 min), consider SAMPLE_INTERVAL_SEC=5 for finer coverage.
Troubleshooting
- OOM / killed process: The v2 pipeline never holds more than 1 frame in memory during Pass 1. If still OOM, increase
SAMPLE_INTERVAL_SECto 15-20. - All emotions "neutral": DeepFace model couldn't download (proxy block). Pass 1 smile detection still works — look at the
num_smilesfield in the manifest. - Face crop is wrong person: The pipeline picks the largest detected face. In screenshare mode, this may be a profile picture rather than a webcam face. Check the full frame to verify.
- No faces detected: Zoom gallery recordings with "shared screen with gallery view" work best. Solo speaker view may have the face too close/large for the cascade detector — try lowering
ANALYSIS_SCALEto 0.3. - Background removal artifacts: rembg's u2net can produce halos around hair. For cleaner results, try the
u2net_human_segmodel:remove(img, model_name='u2net_human_seg'). - Slow processing: A 76-minute video takes ~2 minutes for Pass 1, ~15 seconds for Pass 2 (12 candidates), and ~60 seconds for bg removal (4 faces). Most time is in Pass 1 scanning.