name: story-video description: Convert narrated stories (audio + text) into YouTube Shorts videos (9:16 portrait) with synced subtitles, dynamic background images matched to story content, and professional subtitle effects.
Story-to-Video Skill
Convert bedtime stories, narrations, or any spoken content into engaging YouTube Shorts videos with:
- Synced subtitles - Words highlighted in real-time as spoken
- Dynamic backgrounds - Images searched and selected based on story content/section
- YouTube Shorts format - 9:16 portrait video optimized for mobile
- Professional styling - Centered, animated subtitle effects
Quick Start
Input Requirements
- Audio file - MP3/WAV with narration (e.g., from ElevenLabs TTS)
- Full text transcript - Complete story/narration text
- Story sections (optional) - If available, define sections for targeted background images
Basic Workflow
# 1. Transcribe audio to get word timing (automatic)
story-video transcribe --audio story.mp3 --output story.json
# 2. Generate video with auto-searched backgrounds
story-video generate \
--audio story.mp3 \
--text "Once upon a time..." \
--title "Bedtime Story" \
--output story.mp4
Output: story.mp4 (9:16 portrait, YouTube Shorts ready)
Advanced: Custom Sections & Backgrounds
# Create a config with sections and suggested image searches
story-video generate \
--audio story.mp3 \
--text full_text.txt \
--config story-config.json \
--output story.mp4
story-config.json:
{
"title": "The Snail Designer",
"sections": [
{
"start_time": 0,
"end_time": 15,
"text": "Once upon a time, in the beautiful city of El Paso...",
"search_query": "El Paso desert sunset"
},
{
"start_time": 15,
"end_time": 35,
"text": "...a gentle snail named Snail was a designer.",
"search_query": "3D design workshop creative tools"
}
]
}
How It Works
1. Audio Transcription + Timing
- Uses Groq Whisper (or local speech-to-text) to get word-level timing
- Outputs JSON with
{word, start_ms, end_ms}for each word - Enables precise subtitle sync
2. Section Detection
- Divides audio into chunks (10-30s sections)
- Generates targeted image search queries from text content
- Searches Unsplash/Pexels for relevant high-quality images
3. Video Composition
- Creates 9:16 canvas (1080x1920 pixels)
- Layers background image (center-cropped, subtle zoom)
- Renders subtitles centered, synchronized to audio
- Applies subtitle effects:
- Fade in/out as words appear/disappear
- Color highlight - Current word in bright color, context in white
- Scale animation - Current word slightly larger
- Drop shadow - Professional readability on any background
4. Video Export
- Combines audio + video layers
- H.264 codec, optimized bitrate for YouTube
- Metadata tags for YouTube Shorts (aspect ratio, duration)
Configuration Options
Subtitle Styling
{
"subtitles": {
"font": "Inter",
"size": 48,
"color_current": "#FFD700",
"color_context": "#FFFFFF",
"shadow": true,
"shadow_blur": 8,
"shadow_color": "#000000",
"shadow_offset_y": 3,
"animation_type": "fade_scale",
"animation_duration_ms": 200
}
}
Background Options
{
"background": {
"source": "unsplash",
"fallback_color": "#1a1a1a",
"zoom_effect": "subtle",
"zoom_speed": 0.3,
"fade_between_sections": true,
"fade_duration_ms": 500
}
}
Commands
story-video transcribe
Generate word-level timing from audio.
story-video transcribe --audio input.mp3 --output timing.json
Options:
--audio(required) - Audio file path--output(required) - JSON output with timings--engine(optional) - groq, google, or local (default: groq)
Output format:
{
"duration_ms": 45000,
"words": [
{"word": "Once", "start_ms": 0, "end_ms": 250},
{"word": "upon", "start_ms": 250, "end_ms": 450},
...
]
}
story-video generate
Create video from audio + text.
story-video generate \
--audio input.mp3 \
--text "Story text..." \
--output output.mp4
Options:
--audio(required) - MP3/WAV file--text(required) - Full transcript text--output(required) - MP4 output path--config(optional) - JSON config file (sections, styling, etc.)--title(optional) - Video title (for metadata)--subtitle-style(optional) - Preset: minimal, bold, elegant (default: bold)--background-source(optional) - unsplash, pexels, local_dir (default: unsplash)
story-video style-preset
List available subtitle style presets.
story-video style-preset list
story-video style-preset preview bold
Presets:
- minimal - Small, centered, subtle animation
- bold - Large, bright highlight, dynamic animation
- elegant - Serif font, refined colors, smooth fade
- neon - Bright colors, glow effect, fast animation
Image Search Strategy
The skill auto-generates search queries based on story content:
- Noun extraction - Identifies key nouns (snail, designer, El Paso, daughters)
- Context keywords - Adds context (sunset, desert, workshop, family)
- Search execution - Finds relevant images from Unsplash
- Quality filter - Prefers high-res, professional photos
- Caching - Saves images locally to avoid repeated searches
Example:
Text: "snail named Snail who was a three-dimensional designer"
→ Search: "3D design workshop creative snail"
→ Results: [image1, image2, image3]
→ Select: Best match for this section
Requirements
System Dependencies
ffmpeg- Video composition (brew install ffmpeg)python3- Image processing (PIL/Pillow)
API Keys
- Groq API - Audio transcription (set GROQ_API_KEY)
- Unsplash API (optional) - Image search (set UNSPLASH_API_KEY for more requests)
- ElevenLabs API (optional) - If generating TTS from text first
Python Libraries
ffmpeg-python
pydub
pillow
requests
Workflow Examples
Example 1: Bedtime Story from TTS
# 1. Generate audio (your voice) via ElevenLabs
tts "Once upon a time..." --voice hjX6Urz6dBwVkFdr87DB --output story.mp3
# 2. Convert to video
story-video generate \
--audio story.mp3 \
--text "Once upon a time..." \
--subtitle-style bold \
--background-source unsplash \
--output story-video.mp4
# 3. Upload to YouTube Shorts
# (9:16 format is ready!)
Example 2: Existing Audio with Custom Sections
# Create config with specific sections and background queries
cat > config.json << EOF
{
"title": "The Snail",
"sections": [
{
"start_time": 0,
"end_time": 20,
"search_query": "El Paso desert landscape"
},
{
"start_time": 20,
"end_time": 45,
"search_query": "3D design studio workspace"
}
]
}
EOF
# Generate video with custom sections
story-video generate \
--audio narration.mp3 \
--text transcript.txt \
--config config.json \
--output output.mp4
Example 3: Multiple Stories as Shorts Series
# Generate videos for each story
for story in stories/*.txt; do
audio="${story%.txt}.mp3"
output="videos/$(basename $story .txt).mp4"
story-video generate \
--audio "$audio" \
--text "$story" \
--subtitle-style elegant \
--output "$output"
done
# All ready for YouTube Shorts series
ls -lh videos/*.mp4
Troubleshooting
Video is too fast/slow
Adjust audio speed before generating (use ffmpeg -filter:a "atempo=0.9" to slow down).
Background images not matching content
Customize search queries in config.json sections[].search_query field.
Subtitle readability on bright backgrounds
Switch to --subtitle-style elegant (adds stronger shadow) or use the shadow config option.
ffmpeg not found
Install: brew install ffmpeg
API rate limits
- Groq: Free tier has rate limits; use local Whisper if needed
- Unsplash: Free tier is 50 requests/hour; cache images locally
Bundled Resources
- scripts/generate_video.py - Main video composition logic
- scripts/transcribe_audio.py - Word-level timing extraction
- scripts/search_images.py - Unsplash/Pexels image search
- scripts/subtitle_renderer.py - Animated subtitle rendering
- references/ffmpeg_settings.md - FFmpeg optimization for YouTube Shorts
- references/subtitle_effects.md - Available animation effects and customization
- assets/fonts/ - Default fonts (Inter, Serif fallback)