thumbnail-extraction

star 141

Extracts the most interesting frames from video files for thumbnail compositing. Detects faces, expressions, smiles, and presentation slides. Outputs full frames, face crops, and transparent cutouts. Use when asked to extract thumbnails, find interesting frames, grab screenshots from video, or create thumbnail candidates from recordings.

swyxio By swyxio schedule Updated 4/8/2026

name: thumbnail-extraction description: "Extracts the most interesting frames from video files for thumbnail compositing. Detects faces, expressions, smiles, and presentation slides. Outputs full frames, face crops, and transparent cutouts. Use when asked to extract thumbnails, find interesting frames, grab screenshots from video, or create thumbnail candidates from recordings." version: 0.1.0

Video Thumbnail Extraction

Overview

Automatically scan a local MP4 video (or YouTube URL via yt-dlp) and extract the 4 most visually interesting frames — prioritizing expressive faces (laughing, shocked, smiling) and engaging presentation slides. Outputs full frames, face crops, and background-removed transparent PNGs ready for compositing.

When to Use

  • Before creating YouTube thumbnails (feeds into youtube-thumbnails skill)
  • When you need the best screenshot from a long video recording
  • When compositing a thumbnail and need transparent guest cutouts
  • Processing Zoom gallery recordings, interviews, or presentations

Dependencies

Python packages (install once)

# In sandbox (Cowork VM):
pip install opencv-python scenedetect deepface pillow numpy --break-system-packages

# On host Mac (for background removal — sandbox can't download the model):
pip3 install 'rembg[cpu]' pillow --break-system-packages

System tools

  • ffmpeg (usually pre-installed)
  • python3 (3.10+)
  • yt-dlp (optional, for YouTube URLs): pip install yt-dlp --break-system-packages

Model downloads (first run)

  • DeepFace expression model (~1MB): Downloads automatically on first use. If blocked by proxy, expression detection falls back to OpenCV smile cascade (still effective).
  • rembg u2net model (~176MB): Downloads on first use. Must run on host Mac if sandbox blocks GitHub releases.

Pipeline Architecture

Two-Pass Design (memory-efficient)

Pass 1 — Quick Scan (OpenCV only, no deep learning)

  • Sample a frame every 10 seconds across the video
  • Skip first/last 60 seconds (intro/outro)
  • For each frame:
    • Detect faces via Haar cascade (fast, no GPU needed)
    • Detect smiles within face regions
    • Compute visual variance (proxy for "interesting" content)
    • Detect presentation slides (high edge density + low color saturation)
  • Score each frame based on: face count, smile count, smile size, visual variance
  • Select top 12 diverse candidates using quadrant system: divide video into N time segments, pick best from each → ensures temporal spread

Pass 2 — Deep Analysis (DeepFace, only on top 12 candidates)

  • Re-read only the selected frames from video
  • Run DeepFace emotion detection (happy, surprise, fear, sad, angry, disgust, neutral)
  • Weight emotions by thumbnail value: happy > surprise > fear > angry > sad > neutral
  • Combine Pass 1 score with expression score
  • Final selection: divide candidates into N time segments, pick best from each → guarantees spread across the full video

Pass 3 — Output (rembg, only on final 4 frames)

  • Save full frame as JPG (95% quality)
  • Crop largest detected face with generous padding (0.5x)
  • Run background removal on face crop → transparent PNG
  • Generate manifest JSON with metadata

Scoring Heuristics

Signal Weight Notes
Face detected +2.0 per face (cap 3) Gallery views score high
Smile detected +3.0 per smile Cascade-based, no model needed
Smile size ratio +5.0 × ratio Bigger smiles = more expressive
Multi-person shot +1.0 bonus 2+ faces = engaging
Happy expression +2.0 bonus (Pass 2) Best for thumbnails
Surprise expression +2.0 bonus (Pass 2) Eye-catching
Fear/angry expression +1.0 bonus (Pass 2) "Shocked" reactions
Visual variance +0.0–1.5 Normalized by frame complexity
Presentation slide baseline 1.5 Useful for slide screenshots

Temporal Diversity Algorithm

The pipeline enforces temporal spread to avoid clustering picks in one segment:

  1. Quadrant selection (Pass 1 → Pass 2): Divide video duration into N segments, pick the highest-scoring frame from each segment
  2. Segment-forced selection (Pass 2 → Final): Divide top candidates into top_n equal time segments, pick best from each
  3. Fallback: If any segment is empty, fill from overall top scores

This ensures a 76-minute video yields picks from different parts (e.g., 1:00, 2:10, 21:50, 48:50) rather than clustering in the most face-heavy section.

Usage

Command Line

python3 thumbnail_extractor.py <video_path> [output_dir] [top_n]

Arguments:

  • video_path — Path to MP4 file (required)
  • output_dir — Where to save outputs (default: ~/Downloads/thumb_candidates)
  • top_n — Number of candidates to extract (default: 4)

Examples:

# Basic — extract 4 best frames
python3 thumbnail_extractor.py "GMT20260130-210038_Recording_gallery_2380x1544.mp4"

# Custom output dir and count
python3 thumbnail_extractor.py recording.mp4 ./thumbs 6

# YouTube video (download first)
yt-dlp -o "video.mp4" "https://youtube.com/watch?v=..."
python3 thumbnail_extractor.py video.mp4

Output Files

For each candidate, the pipeline generates:

File Format Description
{name}_{n}_{emotion}_{timestamp}_full.jpg JPG 95% Full video frame
{name}_{n}_{emotion}_{timestamp}_face.jpg JPG 95% Cropped face with padding
{name}_{n}_{emotion}_{timestamp}_transparent.png PNG w/ alpha Background-removed face cutout
{name}_manifest.json JSON Metadata for all candidates

Naming example: GMT20260130-210038_3_happy_2-10_full.jpg

  • GMT20260130-210038 — video name (truncated for Zoom recordings)
  • 3 — candidate number (ranked by score)
  • happy — detected dominant emotion
  • 2-10 — timestamp (2 minutes 10 seconds)
  • full / face / transparent — file type

Manifest JSON Structure

{
  "video": "GMT20260130-210038",
  "candidates": [
    {
      "index": 1,
      "timestamp": "2:10",
      "timestamp_sec": 130.0,
      "emotion": "happy",
      "emotion_score": 0.85,
      "combined_score": 12.4,
      "num_faces": 3,
      "is_presentation": false,
      "files": {
        "full": "..._full.jpg",
        "face_crop": "..._face.jpg",
        "transparent": "..._transparent.png"
      }
    }
  ]
}

Background Removal (Separate Step)

Since the Cowork sandbox may block model downloads, run rembg on the host Mac:

# On host Mac (via osascript or Terminal)
cd ~/Downloads/thumb_candidates
python3 -c "
from rembg import remove
from PIL import Image
import glob, os

for f in sorted(glob.glob('*_face.jpg')):
    out = f.replace('_face.jpg', '_transparent.png')
    print(f'Processing {f}...')
    img = Image.open(f)
    result = remove(img)
    result.save(out)
    print(f'  -> {out} ({os.path.getsize(out)//1024}KB)')
"

This takes 10-15 seconds per image on Apple Silicon. The u2net model downloads automatically on first run (176MB).

Integration with Other Skills

Feeding into youtube-thumbnails

After extraction, use the transparent PNGs as compositing elements:

  1. Pick the best face cutout from the candidates
  2. Use it in the Gemini thumbnail prompt as a reference, or
  3. Composite it manually onto the generated Gemini background using ImageMagick:
# Composite transparent face onto Gemini-generated background
convert gemini_background.jpg transparent_face.png \
  -gravity southeast -geometry +50+50 \
  -composite final_thumbnail.jpg

Pipeline flow

[Video MP4] → thumbnail-extraction → [face crops + transparent PNGs]
                                          ↓
                                   youtube-thumbnails → [Gemini background]
                                          ↓
                                   [Composite final thumbnail]

Tuning Parameters

Edit these at the top of thumbnail_extractor.py:

Parameter Default Effect
SAMPLE_INTERVAL_SEC 10 Lower = more frames scanned, slower
ANALYSIS_SCALE 0.5 Lower = faster face detection, less accurate
SCENE_THRESHOLD 27.0 Lower = more scene boundaries detected
MIN_FACE_CONFIDENCE 0.80 Higher = fewer false positive faces
top_n 4 Number of final candidates

For short videos (<10 min), consider SAMPLE_INTERVAL_SEC=5 for finer coverage.

Troubleshooting

  • OOM / killed process: The v2 pipeline never holds more than 1 frame in memory during Pass 1. If still OOM, increase SAMPLE_INTERVAL_SEC to 15-20.
  • All emotions "neutral": DeepFace model couldn't download (proxy block). Pass 1 smile detection still works — look at the num_smiles field in the manifest.
  • Face crop is wrong person: The pipeline picks the largest detected face. In screenshare mode, this may be a profile picture rather than a webcam face. Check the full frame to verify.
  • No faces detected: Zoom gallery recordings with "shared screen with gallery view" work best. Solo speaker view may have the face too close/large for the cascade detector — try lowering ANALYSIS_SCALE to 0.3.
  • Background removal artifacts: rembg's u2net can produce halos around hair. For cleaner results, try the u2net_human_seg model: remove(img, model_name='u2net_human_seg').
  • Slow processing: A 76-minute video takes ~2 minutes for Pass 1, ~15 seconds for Pass 2 (12 candidates), and ~60 seconds for bg removal (4 faces). Most time is in Pass 1 scanning.
Install via CLI
npx skills add https://github.com/swyxio/skills --skill thumbnail-extraction
Repository Details
star Stars 141
call_split Forks 8
navigation Branch main
article Path SKILL.md
More from Creator