thumbnail-extraction

name: thumbnail-extraction description: "Extracts the most interesting frames from video files for thumbnail compositing. Detects faces, expressions, smiles, and presentation slides. Outputs full frames, face crops, and transparent cutouts. Use when asked to extract thumbnails, find interesting frames, grab screenshots from video, or create thumbnail candidates from recordings." version: 0.1.0

Video Thumbnail Extraction

Overview

Automatically scan a local MP4 video (or YouTube URL via yt-dlp) and extract the 4 most visually interesting frames — prioritizing expressive faces (laughing, shocked, smiling) and engaging presentation slides. Outputs full frames, face crops, and background-removed transparent PNGs ready for compositing.

When to Use

Before creating YouTube thumbnails (feeds into youtube-thumbnails skill)
When you need the best screenshot from a long video recording
When compositing a thumbnail and need transparent guest cutouts
Processing Zoom gallery recordings, interviews, or presentations

Dependencies

Python packages (install once)

# In sandbox (Cowork VM):
pip install opencv-python scenedetect deepface pillow numpy --break-system-packages

# On host Mac (for background removal — sandbox can't download the model):
pip3 install 'rembg[cpu]' pillow --break-system-packages

System tools

ffmpeg (usually pre-installed)
python3 (3.10+)
yt-dlp (optional, for YouTube URLs): pip install yt-dlp --break-system-packages

Model downloads (first run)

DeepFace expression model (~1MB): Downloads automatically on first use. If blocked by proxy, expression detection falls back to OpenCV smile cascade (still effective).
rembg u2net model (~176MB): Downloads on first use. Must run on host Mac if sandbox blocks GitHub releases.

Pipeline Architecture

Two-Pass Design (memory-efficient)

Pass 1 — Quick Scan (OpenCV only, no deep learning)

Sample a frame every 10 seconds across the video
Skip first/last 60 seconds (intro/outro)
For each frame:
- Detect faces via Haar cascade (fast, no GPU needed)
- Detect smiles within face regions
- Compute visual variance (proxy for "interesting" content)
- Detect presentation slides (high edge density + low color saturation)
Score each frame based on: face count, smile count, smile size, visual variance
Select top 12 diverse candidates using quadrant system: divide video into N time segments, pick best from each → ensures temporal spread

Pass 2 — Deep Analysis (DeepFace, only on top 12 candidates)

Re-read only the selected frames from video
Run DeepFace emotion detection (happy, surprise, fear, sad, angry, disgust, neutral)
Weight emotions by thumbnail value: happy > surprise > fear > angry > sad > neutral
Combine Pass 1 score with expression score
Final selection: divide candidates into N time segments, pick best from each → guarantees spread across the full video

Pass 3 — Output (rembg, only on final 4 frames)

Save full frame as JPG (95% quality)
Crop largest detected face with generous padding (0.5x)
Run background removal on face crop → transparent PNG
Generate manifest JSON with metadata

Scoring Heuristics

Signal	Weight	Notes
Face detected	+2.0 per face (cap 3)	Gallery views score high
Smile detected	+3.0 per smile	Cascade-based, no model needed
Smile size ratio	+5.0 × ratio	Bigger smiles = more expressive
Multi-person shot	+1.0 bonus	2+ faces = engaging
Happy expression	+2.0 bonus (Pass 2)	Best for thumbnails
Surprise expression	+2.0 bonus (Pass 2)	Eye-catching
Fear/angry expression	+1.0 bonus (Pass 2)	"Shocked" reactions
Visual variance	+0.0–1.5	Normalized by frame complexity
Presentation slide	baseline 1.5	Useful for slide screenshots

Temporal Diversity Algorithm

The pipeline enforces temporal spread to avoid clustering picks in one segment:

Quadrant selection (Pass 1 → Pass 2): Divide video duration into N segments, pick the highest-scoring frame from each segment
Segment-forced selection (Pass 2 → Final): Divide top candidates into top_n equal time segments, pick best from each
Fallback: If any segment is empty, fill from overall top scores

This ensures a 76-minute video yields picks from different parts (e.g., 1:00, 2:10, 21:50, 48:50) rather than clustering in the most face-heavy section.

Usage

Command Line

python3 thumbnail_extractor.py <video_path> [output_dir] [top_n]

Arguments:

video_path — Path to MP4 file (required)
output_dir — Where to save outputs (default: ~/Downloads/thumb_candidates)
top_n — Number of candidates to extract (default: 4)

Examples:

# Basic — extract 4 best frames
python3 thumbnail_extractor.py "GMT20260130-210038_Recording_gallery_2380x1544.mp4"

# Custom output dir and count
python3 thumbnail_extractor.py recording.mp4 ./thumbs 6

# YouTube video (download first)
yt-dlp -o "video.mp4" "https://youtube.com/watch?v=..."
python3 thumbnail_extractor.py video.mp4

Output Files

For each candidate, the pipeline generates:

File	Format	Description
`{name}_{n}_{emotion}_{timestamp}_full.jpg`	JPG 95%	Full video frame
`{name}_{n}_{emotion}_{timestamp}_face.jpg`	JPG 95%	Cropped face with padding
`{name}_{n}_{emotion}_{timestamp}_transparent.png`	PNG w/ alpha	Background-removed face cutout
`{name}_manifest.json`	JSON	Metadata for all candidates

Naming example: GMT20260130-210038_3_happy_2-10_full.jpg

GMT20260130-210038 — video name (truncated for Zoom recordings)
3 — candidate number (ranked by score)
happy — detected dominant emotion
2-10 — timestamp (2 minutes 10 seconds)
full / face / transparent — file type

Manifest JSON Structure

{
  "video": "GMT20260130-210038",
  "candidates": [
    {
      "index": 1,
      "timestamp": "2:10",
      "timestamp_sec": 130.0,
      "emotion": "happy",
      "emotion_score": 0.85,
      "combined_score": 12.4,
      "num_faces": 3,
      "is_presentation": false,
      "files": {
        "full": "..._full.jpg",
        "face_crop": "..._face.jpg",
        "transparent": "..._transparent.png"
      }
    }
  ]
}

Background Removal (Separate Step)

Since the Cowork sandbox may block model downloads, run rembg on the host Mac:

# On host Mac (via osascript or Terminal)
cd ~/Downloads/thumb_candidates
python3 -c "
from rembg import remove
from PIL import Image
import glob, os

for f in sorted(glob.glob('*_face.jpg')):
    out = f.replace('_face.jpg', '_transparent.png')
    print(f'Processing {f}...')
    img = Image.open(f)
    result = remove(img)
    result.save(out)
    print(f'  -> {out} ({os.path.getsize(out)//1024}KB)')
"

This takes ~~10-15 seconds per image on Apple Silicon. The u2net model downloads automatically on first run (~~176MB).

Integration with Other Skills

Feeding into `youtube-thumbnails`

After extraction, use the transparent PNGs as compositing elements:

Pick the best face cutout from the candidates
Use it in the Gemini thumbnail prompt as a reference, or
Composite it manually onto the generated Gemini background using ImageMagick:

# Composite transparent face onto Gemini-generated background
convert gemini_background.jpg transparent_face.png \
  -gravity southeast -geometry +50+50 \
  -composite final_thumbnail.jpg

Pipeline flow

[Video MP4] → thumbnail-extraction → [face crops + transparent PNGs]
                                          ↓
                                   youtube-thumbnails → [Gemini background]
                                          ↓
                                   [Composite final thumbnail]

Tuning Parameters

Edit these at the top of thumbnail_extractor.py:

Parameter	Default	Effect
`SAMPLE_INTERVAL_SEC`	10	Lower = more frames scanned, slower
`ANALYSIS_SCALE`	0.5	Lower = faster face detection, less accurate
`SCENE_THRESHOLD`	27.0	Lower = more scene boundaries detected
`MIN_FACE_CONFIDENCE`	0.80	Higher = fewer false positive faces
`top_n`	4	Number of final candidates

For short videos (<10 min), consider SAMPLE_INTERVAL_SEC=5 for finer coverage.

Troubleshooting

OOM / killed process: The v2 pipeline never holds more than 1 frame in memory during Pass 1. If still OOM, increase SAMPLE_INTERVAL_SEC to 15-20.
All emotions "neutral": DeepFace model couldn't download (proxy block). Pass 1 smile detection still works — look at the num_smiles field in the manifest.
Face crop is wrong person: The pipeline picks the largest detected face. In screenshare mode, this may be a profile picture rather than a webcam face. Check the full frame to verify.
No faces detected: Zoom gallery recordings with "shared screen with gallery view" work best. Solo speaker view may have the face too close/large for the cascade detector — try lowering ANALYSIS_SCALE to 0.3.
Background removal artifacts: rembg's u2net can produce halos around hair. For cleaner results, try the u2net_human_seg model: remove(img, model_name='u2net_human_seg').
Slow processing: A 76-minute video takes ~2 minutes for Pass 1, ~15 seconds for Pass 2 (12 candidates), and ~60 seconds for bg removal (4 faces). Most time is in Pass 1 scanning.