ppt-speech-script - SKILL.md Agent Skill

name: ppt-speech-script description: "Generate a speech script from an existing PowerPoint presentation. Use this skill when the user has a .pptx file and wants a speaker script, presentation notes, or talk script to accompany it. Triggers on: 'speech script', 'speaker notes', 'talk script', 'presentation script', '演讲稿', '讲稿'" license: Proprietary. LICENSE.txt has complete terms

PPT Speech Script Generator

Generate a structured, oral-style speech script from an existing PowerPoint presentation.

When to Use

User has an existing .pptx file and wants a speech script to go with it
User says: "generate a speech script", "write speaker notes", "写演讲稿", "生成讲稿"
This is NOT for creating PPT files — use the pptx skill for that

Workflow

Step 1: Extract Content

Extract text, visual layout, and embedded images from the presentation using three methods in parallel:

Text extraction:

python3 -m markitdown <pptx_file>

Visual layout (thumbnail grid):

python3 skills/pptx/scripts/thumbnail.py <pptx_file> <output_dir>/thumbnails --cols 5

Image extraction (unpack media files):

python3 skills/pptx/ooxml/scripts/unpack.py <pptx_file> <output_dir>/unpacked

Then:

Read the thumbnail grid image(s) to understand the visual structure of each slide
Build a slide-to-image mapping by reading the relationship files:
- Parse ppt/slides/_rels/slideN.xml.rels for each slide
- Extract Target="../media/imageX.ext" references to find which images belong to which slide
For slides with important images (charts, architecture diagrams, screenshots), read the original image from ppt/media/ using the Read tool for detailed visual analysis
- Skip generic/decorative images (backgrounds, logos, icons) — focus on content-carrying images
- Prioritize: architecture diagrams, data charts, comparison tables, workflow diagrams, product screenshots

Image analysis guidelines:

For architecture/flow diagrams: describe the components, data flow direction, and key relationships
For data charts: read exact numbers, axis labels, and trends
For comparison tables: extract the key differentiators
For product screenshots: describe what the user interface shows
For photos/decorative images: brief description only, don't over-analyze

This image analysis enables the speech script to accurately describe visual content that markitdown cannot capture (since markitdown only extracts text, not image content).

Step 2: Analyze Presentation Structure

Before writing, analyze the deck:

Identify sections — group slides into logical sections (intro, body sections, conclusion)
Identify hidden slides — thumbnail.py reports hidden slides; exclude them from the script
Identify slide types — title, content, data, diagram, comparison, section divider, closing
Note visual elements — charts, images, diagrams that need verbal explanation
Estimate timing — allocate time per slide based on content density

Timing guidelines:

Slide type	Suggested time
Title/cover	30s - 1min
Agenda	30s - 1min
Section divider	15 - 30s
Content (light)	1 - 1.5min
Content (dense)	1.5 - 2.5min
Data/chart	1.5 - 2min
Diagram/architecture	2 - 3min
Demo/code	2 - 3min
Summary	1 - 2min
Closing/Q&A	30s

Step 3: Write the Speech Script

Generate a complete speech script following these principles:

Output Format

---
title: "PPT Title — Speech Script"
slides: <total visible slides>
estimated_time: "XX-XX minutes"
audience: "<target audience>"
---

# PPT Title — Speech Script

> Audience: <target audience> | Date: <date if available>

Estimated duration: XX minutes (including Q&A). Suggested time per slide is noted in brackets.

---

## Slide 0 — Slide Title [30s]

Speech content here...

---

## Slide 1 — Slide Title [1min]

Speech content here...

---

...

## Predicted Q&A

### Q1: <likely question>?
Answer key points...

### Q2: <likely question>?
Answer key points...

### Q3: <likely question>?
Answer key points...

Writing Principles

Oral style — Write as if speaking to the audience, not reading a document
- Use conversational connectors: "Let's look at...", "The key takeaway here is...", "Now, moving on to..."
- Avoid academic or written-style phrasing
Don't read the bullets — The script should EXPLAIN and EXPAND on slide content, not repeat it
- Slide says "Cost reduced 34.8%" → Script says "We brought per-query cost down by over a third — from 9 cents to under 6 cents. At million-user scale, that's millions of dollars saved per month."
Per-slide length: 80-200 words (Chinese) or 60-150 words (English)
- Section dividers and title pages: shorter (30-60 words)
- Dense data or architecture slides: longer (150-250 words)
Smooth transitions — Each slide's script should naturally flow from the previous one
- End of previous slide's conclusion → Beginning of next slide's topic
- Use transitional phrases: "That brings us to...", "With that context in mind...", "So how do we solve this?"
Highlight key points — Use bold or verbal cues for emphasis
- "The most important number on this slide is..."
- "If you remember one thing from today..."
Explain visuals — For charts, diagrams, and images, guide the audience through what they're seeing
- "Looking at this architecture diagram, data flows from left to right..."
- "The blue bars represent the baseline, and the orange bars are our optimized results..."
Audience awareness — Tailor depth and terminology to the stated audience
- CTO audience → focus on strategic impact, cost, and scalability
- Developer audience → focus on implementation details and code
- Business audience → focus on ROI, user impact, and market context
Q&A section — Prepare 3-5 predicted questions
- Include "tough but fair" questions the audience is likely to ask
- Provide concise answer key points (not full scripts)
- Consider the audience's perspective and concerns

Language

Default: Match the language of the PPT content
User override: If user specifies a language (e.g., "in Chinese", "in English"), use that
Mixed content: If PPT has mixed languages, use the dominant language unless told otherwise

Step 4: Save Output

Save the speech script to the same directory as the PPT file:

# Output path: same directory as input, named speech-script.md
<pptx_dir>/speech-script.md

If a speech-script.md already exists, ask the user before overwriting.

Dependencies

These should already be available from the pptx skill:

markitdown: pip install "markitdown[pptx]" — text extraction
thumbnail.py: skills/pptx/scripts/thumbnail.py — visual layout analysis
unpack.py: skills/pptx/ooxml/scripts/unpack.py — PPTX unpacking for media extraction
LibreOffice: for PDF conversion (used by thumbnail.py)
Poppler: for PDF-to-image conversion (used by thumbnail.py)

Edge Cases

Very large presentations (50+ slides):

Group consecutive similar slides into sections
Summarize repetitive slides rather than scripting each individually
Note in the script: "Slides X-Y cover [topic] — walk through highlights"

Image-heavy / text-light slides:

Use the slide-to-image mapping to read original images from ppt/media/ at full resolution
Describe what the audience sees based on direct image analysis
For images that cannot be read (e.g., unsupported format), fall back to thumbnail grid analysis
Flag slides where content is still unclear: "[Note: This slide contains a visual element — verify description against actual slide]"

Hidden slides:

Exclude from the main script
Optionally note them at the end: "Note: Slides X, Y, Z are hidden and not included in this script"

No text content (pure image deck):

Extract all images via unpack and read each one directly for full-resolution analysis
Use thumbnail grid for overall slide layout understanding
Generate descriptive narration based on per-image analysis
Clearly note which descriptions are based on visual interpretation

Quality Checklist

After generating the script, verify:

Every visible slide has a corresponding section
Slide numbering matches the PPT (0-indexed)
Transitions between slides are smooth
No bullet points are simply repeated verbatim
Time estimates per slide are reasonable
Total estimated time is realistic for the slide count
Q&A section includes 3-5 relevant questions
Language matches user's request or PPT's dominant language