name: lecture-to-notes description: Generate professional, information-dense, figure-rich LaTeX course notes and compiled PDF from a YouTube or Bilibili lecture video. Use when the user provides a video URL and wants structured Chinese teaching notes. Key features include smart slide-region cropping (removes lecturer, keeps only slide content), three-level subtitle fallback (CC → Whisper → visual-only), dense frame sampling with contact-sheet review, and high information density writing. Trigger words include lecture notes, 课程笔记, 视频转PDF, 讲义, YouTube笔记, B站笔记, BV号.
Lecture to Notes
Turn a lecture video (YouTube or Bilibili) into a complete, compilable .tex note set and a rendered PDF.
Dependencies
Check before starting (use which). Prompt the user to install any missing tools.
| Tool | Required | Purpose |
|---|---|---|
yt-dlp |
Always | Video/subtitle/metadata download (supports YouTube + Bilibili) |
ffmpeg |
Always | Frame extraction, audio extraction |
xelatex |
Always | LaTeX compilation (TeX Live + CTeX for Chinese) |
magick |
Always | Frame montage, contact sheets, cropping |
python3 |
Always | Smart crop script, Whisper |
whisper |
Bilibili / no-CC | Speech-to-text fallback (openai-whisper) |
Additional Python dependencies: Pillow (pip install Pillow).
YouTube Cookie Notice
YouTube may require authentication to avoid bot detection. When yt-dlp fails with "Sign in to confirm you're not a bot", add --cookies-from-browser chrome (or safari/firefox/edge) to all yt-dlp commands.
Goal
Produce a professional Chinese lecture note from a video URL. The output must:
- use the video's actual teaching content, not just subtitle transcription
- place the video's original cover image on the front page
- include smart-cropped slide figures (lecturer removed, only slide/PPT content)
- achieve high information density — every figure, box, and paragraph earns its space
- be structurally organized with
\section{}/\subsection{} - end with a synthesis section combining speaker's conclusions and your own distillation
- be a complete
.texfrom\documentclassto\end{document} - compile successfully to PDF
Platform Detection
Detect the platform from the URL:
| Pattern | Platform |
|---|---|
youtube.com, youtu.be |
YouTube |
bilibili.com/video/BV, b23.tv |
Bilibili |
Adapt the acquisition workflow accordingly (see below).
Workflow
Working Directory Convention
CRITICAL: Always use absolute paths for background commands (Whisper, video download). Claude Code's shell resets the working directory between commands. Background tasks that use relative paths will write output to the wrong location.
Recommended naming: <course_id>_<lecture_number>_<short_title>/
- Example:
nju_os_01_intro/,nju_os_02_app_view/,tamu_biegler_nlp/
Phase 1: Source Acquisition
1a. Metadata Inspection
yt-dlp --dump-json --no-download "<URL>" > metadata.json
Extract: title, uploader, duration, chapters, thumbnail URL, subtitle availability. For Bilibili, also check for multi-part (分P) videos.
1b. Subtitle Acquisition (Three-Level Fallback)
Priority 1 — CC subtitles:
# YouTube
yt-dlp --write-subs --sub-langs "zh.*,en.*" --convert-subs srt --skip-download "<URL>"
# Bilibili
yt-dlp --write-subs --sub-langs "zh-Hans,zh-CN,zh,ai-zh" --convert-subs srt --skip-download "<URL>"
Priority 1.5 — YouTube auto-generated subtitles (when no manual CC):
yt-dlp --write-auto-subs --sub-langs "en" --convert-subs srt --skip-download "<URL>"
# IMPORTANT: Clean duplicates — YouTube auto-subs repeat every line 2-3x
python3 scripts/clean_subs.py subs.en.srt --stats
Priority 2 — Whisper speech-to-text (when no CC subtitles):
yt-dlp -x --audio-format wav -o "audio.%(ext)s" "<URL>"
# IMPORTANT: Use absolute paths for Whisper to avoid working directory issues
# when running in background. The shell may reset cwd between commands.
# IMPORTANT: ffmpeg must be on PATH — Whisper uses it internally to read audio.
WORKDIR="$(pwd)"
whisper "$WORKDIR/audio.wav" --model small --language zh \
--output_format srt --output_dir "$WORKDIR" --fp16 False \
--initial_prompt "$(cat scripts/whisper_prompts/nju_os.txt)" # Optional: domain glossary
Whisper initial_prompt (strongly recommended for technical lectures):
Point --initial_prompt at a plain-text file enumerating domain terms (syscalls, APIs,
speaker names, course-specific jargon). See scripts/whisper_prompts/nju_os.txt for a
working example. This dramatically reduces same-sound errors like
"PASSNAME" instead of "pathname" or "SAM" instead of "sum".
Priority 2.5 — SRT correction passes (after Whisper):
# Stage A — fast dictionary-level fix (wrong → right pairs)
python3 scripts/correct_srt.py audio.srt \
-g scripts/whisper_prompts/glossary_nju_os.json --stats
# Stage B — slow LLM + multimodal fix (uses Claude Code CLI, no API key needed)
python3 scripts/llm_correct_srt.py \
--srt audio.srt --frames frames/ --out corrected.srt \
--context "南京大学操作系统原理,讲师 jyy"
Stage A is essentially free and catches 80% of wrong characters. Stage B is expensive (one Claude call per ~90s of audio) and only worth running for notes you plan to publish.
Priority 3 — Visual-only mode (when audio quality is unusable): Skip subtitles. Use dense frame sampling (fps=1) and rely entirely on visual content.
1c. Video and Cover Download
# Cover image (may be webp/png/jpg depending on platform)
yt-dlp --write-thumbnail --skip-download -o "cover" "<URL>"
# Convert to jpg for xelatex compatibility
bash scripts/prepare_cover.sh .
# Video (for frame extraction)
yt-dlp -f "bestvideo+bestaudio/best" --merge-output-format mp4 -o "video.mp4" "<URL>"
# Bilibili 1080P+ (if user has logged in):
# yt-dlp --cookies-from-browser chrome -f "bestvideo+bestaudio/best" -o "video.mp4" "<URL>"
1d. Bilibili Multi-Part (分P) Handling
yt-dlp --flat-playlist --dump-json "<URL>" # List all parts
yt-dlp --playlist-items 1-3 -o "P%(playlist_index)s.%(ext)s" "<URL>" # Download specific parts
Always ask the user which parts to process before downloading.
Phase 2: Frame Extraction and Smart Cropping
This is where we differ most from other tools. Two-stage process:
Stage 1: Dense frame extraction by chapter
mkdir -p frames
# Extract 1 frame every 15 seconds per chapter
ffmpeg -ss <start> -to <end> -i video.mp4 -vf "fps=1/15" frames/ch<N>_%03d.png
Stage 2: Frame selection (no automatic cropping)
Use the original full frames directly. Do NOT apply automatic cropping — heuristic-based cropping is unreliable (misidentifies blackboard content as "low information" regions).
Future direction: Use multimodal LLM (e.g., Claude Vision API) to classify frames and decide cropping per-frame. For now, use full frames and select the best ones manually via contact sheet review.
Stage 3: Contact sheet review
# Generate contact sheets per chapter (keeps montage size manageable)
# For chapters: montage per chapter directory
magick montage frames/ch<N>_*.png -tile 5x -geometry 384x216+2+2 contact_ch<N>.png
# For unchaptered videos with many frames (>100):
# Split into batches of 50 to avoid montage failures
for i in $(seq 1 50 $(ls frames/*.png | wc -l)); do
ls frames/*.png | tail -n +$i | head -50 | xargs magick montage \
-tile 5x -geometry 384x216+2+2 contact_batch_${i}.png
done
Review contact sheets to select the best frames. Criteria:
- Pick the final fully-populated state of progressive reveals
- Prefer the frame with the most complete and readable information
- Drop repetitive or low-information frames
- Keep every frame that teaches something distinct
Stage 4: Figure Verification (CRITICAL — prevents figure-text mismatch)
Problem: ~40% of figures in early versions had mismatched timestamps, captions, or surrounding text. Root cause: timestamps were estimated from frame numbers, captions were written from "what should be here" rather than "what IS here", and content was inferred from section structure rather than verified against the actual frame.
Before writing ANY figure into LaTeX, perform this three-way verification:
For each candidate figure:
Compute exact timestamp:
chapter_start_seconds + (frame_number - 1) × 15Cross-reference with subtitle: Find the Whisper/CC subtitle entry at that timestamp. Read 2-3 subtitle lines before and after. This tells you what the speaker is ACTUALLY saying at the moment of that frame.
# Quick lookup: what was being said at timestamp T? import re target_sec = 285 # example: 4:45 for entry in srt_entries: if abs(entry.start - target_sec) < 15: print(f"{entry.start}s: {entry.text}")Read the frame at full resolution: Use the Read tool to view the actual frame image. Identify the ACTUAL text/diagram/code shown on screen — not what you think should be there.
Three-way match check: Verify that ALL THREE align:
- ✅ Frame visual content (what the slide/screen actually shows)
- ✅ Subtitle content (what the speaker is saying at that moment)
- ✅ Your caption + surrounding text (what you plan to write)
If any mismatch: either pick a different frame, adjust the timestamp, or rewrite the caption.
Common failure modes to watch for:
| Failure | Example | Fix |
|---|---|---|
| Frame shows slide A, caption describes slide B | Frame shows "Language" section but caption says "Transformer" | Read frame at full res before writing caption |
| Timestamp off by 1-2 minutes | @07:00 claimed but actual content is at @09:00 | Cross-check with subtitle timestamps |
| Caption describes the section topic, not the frame | "Scaling Law 幂律关系" but frame shows a chess board | Write caption from frame content, not section title |
| Frame is transitional (between slides) | Half old slide, half new slide | Pick a frame 15s earlier or later |
Phase 3: Writing
Teaching Content Rules
Include: title, chapters, on-screen diagrams/formulas/tables/code, subtitle explanations, speaker emphasis.
Exclude: greetings, small talk, sponsorship, channel logistics, 一键三连, 关注投币, closing pleasantries.
Preserve: speaker's closing discussion when it carries teaching value (synthesis, limitations, advice, open questions).
Writing Rules
Chinese by default unless user requests otherwise.
Organize with
\section{}/\subsection{}. Reconstruct the teaching flow — don't mirror subtitle order.Start from
assets/notes-template.tex. Fill metadata and replace the body block.Front page cover: video's original cover image, visually distinct from in-body figures.
Figures: use full frames. Use as many figures as needed for teaching clarity — do not optimize for a low count. Every figure MUST pass the Stage 4 three-way verification before being written into LaTeX. Never write a caption from section context alone — always read the actual frame first.
No figures inside boxes.
importantbox,knowledgebox,warningboxmust not contain\includegraphics.Math: display math
$$...$$followed immediately by a symbol explanation list.Code: wrap in
lstlistingwith descriptivecaption.Box strategy — no quota, use as many as the teaching signal demands:
importantbox: core concepts, definitions, key mechanisms, theorem-like statementsknowledgebox: background, history, design tradeoffs, terminology, analogieswarningbox: common mistakes, hidden assumptions, pitfalls, causal confusions
Every major
\sectionends with\subsection{本章小结}. Add\subsection{拓展阅读}when worthwhile.Final section
\section{总结与延伸}:- Speaker's substantive closing (no sign-off fluff)
- Your structured distillation of core claims and mechanisms
- Cross-section synthesis, conceptual compression
- Concrete takeaways, open questions, next steps
No
[cite]placeholders.
Figure Time Provenance
Every figure from a video frame must have a same-page footnote with the source time interval:
\begin{figure}[H]
\centering
\includegraphics[width=\textwidth]{figures/fig_xxx.png}
\caption{描述\protect\footnotemark}
\end{figure}
\footnotetext{视频画面时间区间:00:12:31--00:12:46。}
- Time intervals come from subtitle alignment, not chapter-level guesses.
- Use
[H]or stable placement to keep figure and footnote on the same page.
Visualization
For concepts that screenshots and prose can't explain clearly, add visualizations:
- LaTeX-native: TikZ / PGFPlots
- Pre-generated: Python matplotlib scripts
Use for: process flows, architecture layouts, scaling-law plots, comparison charts. No decorative graphics.
Phase 4: Compilation and Delivery
xelatex -interaction=nonstopmode notes.tex && xelatex -interaction=nonstopmode notes.tex
Delivery Checklist
- Final
.texfile - Cover image (local file)
- All smart-cropped figure assets in
figures/ - Compiled PDF (two-pass xelatex for TOC)
- Whisper-generated SRT file (if speech-to-text was used)
Assets
assets/notes-template.tex: LaTeX templatescripts/clean_subs.py: YouTube auto-subtitle deduplicationscripts/correct_srt.py: Whisper SRT dictionary-level fix (fast, data-driven)scripts/llm_correct_srt.py: Whisper SRT LLM + multimodal segment-level fix (slow, uses Claude Code CLI — no API key needed)scripts/verify_figures.py: Three-way figure verification (timestamp × subtitle × frame)scripts/prepare_cover.sh: Cover image format conversion (webp/png → jpg)scripts/smart_crop.py: Slide region detection (experimental — production flow uses full frames instead)scripts/whisper_prompts/nju_os.txt: Whisper--initial_promptglossary examplescripts/whisper_prompts/glossary_nju_os.json: Dictionary ofwrong → rightpairs forcorrect_srt.py