p2v-phase-1-repo-script - SKILL.md Agent Skill

name: p2v-phase-1-repo-script description: Generate and validate video_script.jsonl from a code repository (GitHub URL or local path). Use this when running phase 1 of the repo-to-video pipeline. metadata: short-description: P2V phase 1 repo script generation

P2V Phase 1: Repo Script

When to use

Use this skill when the user wants phase 1 of repo-to-video: repository input to validated video_script.jsonl.

Inputs

repository URL (e.g. https://github.com/org/repo) or local path (e.g. /path/to/repo)
output run directory (default: outputs/<video_id>-<timestamp>)

Input Handling

GitHub URL: Clone or fetch the repo contents. Read README, key source files, config, and docs.
Local path: Read directly from disk. Resolve the path and explore the directory tree.

In both cases, build a mental model of the repo before writing any script.

Workflow

Run a mandatory preparation pass before drafting any script lines.
Follow the pedagogical framework at:
- docs/educational-video-pedagogy-framework.md
- docs/00-system-contract.md
Draft one coherent educational script from the preparation results (not directly from raw code).
Enforce the contract fields in video_script.jsonl.
Save as video_script.jsonl in the run folder.
Validate:

uv run python -c "from pathlib import Path; from paper2video.contracts.io import validate_artifact; validate_artifact(Path('<video_script.jsonl>'), artifact_type='video_script'); print('video_script contract ok')"

Required output

<run_dir>/video_script.jsonl

Mandatory Preparation Pass (Internal, Phase-1 only)

Before writing the first record, the agent must do this internally:

Codebase extraction
- primary purpose: what problem does this repo solve?
- architecture: high-level components, layers, data flow
- key abstractions: the 3-5 core types/interfaces/patterns that define the system
- design decisions: why was it built this way? what tradeoffs were made?
- dependencies and integration points
- known limitations, tech debt, or open issues (from README, issues, TODOs)
Pedagogical recomposition
- learner-first sequence (not directory/module order)
- narrative arc: hook -> architecture overview -> core mechanism -> how pieces compose -> tradeoffs -> synthesis
- prerequisite and misconception map (what must viewers already know?)
Script planning
- chapter plan with explicit didactic objective per chapter
- segment purpose statements that justify each segment
- duration estimate based on repo complexity

Do not ask the user for these artifacts. Build them internally, then emit only video_script.jsonl.

Codebase Exploration Strategy

To build the codebase extraction, follow this order:

Orientation: README, package manifest (pyproject.toml, package.json, Cargo.toml, etc.), top-level directory structure
Entry points: CLI commands, main functions, API routers, app factory
Core domain: The 3-5 most important modules/classes that implement the primary purpose
Data flow: How data moves through the system (request lifecycle, pipeline stages, event flow)
Configuration: Settings, environment variables, feature flags
Tests: Scan test files for usage patterns and edge cases that reveal design intent

Do NOT try to read every file. Focus on the files that reveal architecture and intent.

Complexity-To-Depth Policy (Required)

Before drafting, assign a complexity tier using repo content:

tier_1 (simple utility/library): single responsibility, few modules, straightforward API
tier_2 (moderate application): multiple subsystems, some non-trivial patterns
tier_3 (complex system): many interacting components, non-trivial architecture, significant design decisions
tier_4 (very complex): tier_3 plus distributed components, multiple protocols, or heavy infrastructure

Use this mapping for script depth:

tier_1: 700-1100 words (~5-8 min)
tier_2: 1100-1700 words (~8-13 min)
tier_3: 1700-2600 words (~13-20 min)
tier_4: 2400-3600 words (~18-28 min)

If draft word count is below tier minimum, expand with:

deeper walkthrough of core data flow
concrete examples of how key abstractions compose
design decision rationale and alternatives considered
edge cases and failure modes

Canonical Narrative Arc for Repos

Adapt the pedagogical framework's arc to code:

Hook through tension: Start with the problem the repo solves. Make viewers feel the pain point.
- "Imagine you need to X, but Y makes it hard..."
- "Every time you do X, you hit this wall..."
Promise and scope: What will the viewer understand by the end?
Architecture overview: The 10,000-foot view. Key components and how they connect.
Toy world / minimal example: Show the simplest use case that exercises the core path.
Core mechanism walkthrough: Walk through the main code path with concrete examples.
How pieces compose: Show how modules interact, data transforms, state flows.
Design decisions and tradeoffs: Why this architecture? What alternatives were considered?
Limits and future: Known limitations, tech debt, roadmap.
Synthesis: Tie back to the opening — the viewer now understands how the system works.

Story Archetype Selection

Choose the best archetype for the repo:

A) Architecture Explainer: For libraries/frameworks. "Here's how it works under the hood."
B) Problem-Solution Journey: For applications. "Here's the problem → here's how this repo solves it."
C) Design Decision Deep-Dive: For repos with interesting engineering tradeoffs. "Why was it built this way?"

VideoMetaRecord Format

The first record in video_script.jsonl must be:

{
  "record_type": "video_meta",
  "video_id": "<repo-name>",
  "paper": {
    "source_type": "repository",
    "repo_url_or_path": "<input>",
    "repo_name": "<name>",
    "primary_language": "<lang>",
    "description": "<one-line description>"
  },
  "primary_thesis": "<what this repo does and why it matters>"
}

Depth And Specificity Rules

The script must reflect genuine understanding of the codebase:

Include concrete code details where possible:
- actual class/function names from the repo
- real data flow paths
- specific design patterns used
- concrete configuration or API surface
Avoid generic descriptions that could apply to any repo.
Tie architectural claims to actual code structure.
Use explicit transitions that preserve technical continuity.
Duration is repo-dependent:
- do not force a fixed runtime target
- simple utilities can be shorter
- complex systems should expand enough to cover architecture thoroughly
Do not collapse complex repos into a marketing summary.

If the current draft feels generic, refine before finalizing.

Narration Voice Rules (Required)

narration_text must sound like an educational video, not a lecture outline:

Never use meta-outline phrasing inside narration text:
- avoid: Chapter 1, Chapter 2, Section, Lecture, In this chapter
Keep chapter metadata in fields (record_type=chapter, chapter_id) but keep spoken text natural.
Prefer direct viewer-facing transitions:
- examples: Now let's look at how..., Next we trace the data through..., Here's where it gets interesting...
Avoid production/meta instructions in narration:
- no references to script-writing process, tiers, or internal planning artifacts.
Use code-native vocabulary naturally:
- "the handler grabs the request and...", "this decorator wraps...", "the pipeline stages chain together..."

Didactic Density Rules (Required)

Keep the script teachable for video viewers:

One core idea per narration unit.
- each segment should deliver one primary teaching point plus at most one supporting point.
Control spoken technical load.
- don't enumerate every method signature or config option in speech.
- focus on the "why" and "how", leave exhaustive API surface to docs.
Split dense units.
- if a segment covers more than two major components, split it into sequential segments.
Keep recaps short and retrieval-oriented.
Code snippets are for visuals, not speech.
- narration should describe what the code does conceptually, the animation shows the code.