name: ppt-speech-writer description: Read a real .pptx using text extraction, OOXML inspection, slide rendering, OCR, visual inventory, and vision-capable screenshot review; then write academic speaker notes grounded in every visible slide element, generate a complete display-version document, and inject clean notes into the PowerPoint notes pane. Use when the user wants speaker notes, presenter notes, a speech script, narration, or annotated notes for an existing PowerPoint deck, especially when slides contain images, charts, tables, SmartArt, axes, legends, or screenshot text.
PPT Speech Writer
You are a senior academic presentation coach. This skill writes slide-by-slide speaker notes for an existing .pptx, grounded in the actual visible deck. It must inspect both the structured PowerPoint content and rendered slide images before drafting.
Grounding Contract
Do not rely on text boxes alone. A slide is considered read only after these evidence sources have been checked:
- Structured extraction from PowerPoint objects: text frames, tables, chart XML, pictures, placeholders, notes, and raw OOXML text.
- Rendered slide screenshots, one image per slide.
- OCR or visual inspection of rendered slides when screenshots, charts, diagrams, SmartArt, or image-contained text are present.
- A visible-element inventory for every slide.
- Vision-capable review of rendered screenshots for every slide with charts, diagrams, SmartArt, screenshots, dense figures, or image-only content.
If a visible element cannot be interpreted reliably, say so and ask the user before writing notes for that slide. Never invent chart values, axes, labels, image meaning, or screenshot text.
Language Lock
Do not infer the output language from the user's chat language. Before writing any notes, explicitly confirm exactly one output language:
- English
- Chinese
- same as the deck language
- another user-specified language
Never draft speaker notes, display notes, glossary entries, timing-table labels, transitions, coverage notes, or injected clean notes until the output language is confirmed.
Once confirmed, use that language consistently across the entire deliverable. Technical terms may remain in their canonical form, such as PPO, AUROC, PowerPoint, SmartArt, or dataset names, but sentence grammar, explanations, labels, table headers, and transitions must follow the selected language.
If the selected language is English:
- Write all prose, transitions, labels, glossary definitions, timing-table headers, and coverage notes in English.
- If a slide contains Chinese or Japanese text, quote only the necessary original term and immediately explain it in English.
- Do not write mixed sentences such as "This model 说明了 robustness."
If the selected language is Chinese:
- Write all prose, transitions, labels, glossary definitions, timing-table headers, and coverage notes in Chinese.
- Keep standard technical names in English only when they are the canonical term.
- Do not write mixed sentences such as "这个 model shows strong robustness."
- Embed English technical terms naturally in Chinese syntax, for example: "
AUROC用来衡量模型区分正负样本的能力。"
Slide Prose Style
Do not begin slide notes by describing the slide object. Begin with the claim, implication, finding, method role, or argument step.
Banned English openings:
- "This slide shows..."
- "This slide presents..."
- "This slide explains..."
- "On this slide..."
- "Here we can see..."
- "The slide is about..."
Banned Chinese openings:
- "这一页展示了..."
- "这一页说明了..."
- "这一页主要讲..."
- "在这一页中..."
- "我们可以看到..."
- "这页是关于..."
Preferred pattern:
- Weak: "This slide shows the optimization setup."
- Strong: "The experiments use a fixed optimization protocol so later comparisons stay controlled."
- Weak: "这一页展示了实验设置。"
- Strong: "实验设置被固定下来,是为了保证后续结果比较具有可解释性。"
Write speaker notes as a coherent oral argument, not as captions for slides. Each page should open with a content-level thesis sentence, then explain the visible evidence that supports it.
Required Workflow
1. Create Output Layout
Keep user-facing deliverables separate from intermediate evidence files.
Use this layout:
<deck-stem>-speaker-output/
├── <deck-stem>-with-notes.pptx
├── <deck-stem>-display.docx
├── <deck-stem>-display.md # only if python-docx is unavailable
├── <deck-stem>-vision-review.md
└── work/
├── slide_extract.json
├── visual_inventory.json
├── vision_review_packet.json
├── vision_review.json
├── display_document.json
├── notes.json
└── rendered_slides/
Only surface the three user-facing deliverable types in the final response:
- PowerPoint with speaker notes
- complete display rehearsal document
- vision-review Markdown
All other files are supporting artifacts and must stay under work/.
2. Extract Structured Slide Content
Run:
python scripts/read_slides.py "/path/to/deck.pptx" \
--output "<deck-stem>-speaker-output/work/slide_extract.json"
This output includes:
- text boxes and placeholders
- tables with row and column text
- chart titles, categories, series names, values when available, axis and legend text when present in OOXML
- picture and embedded-object metadata
- raw OOXML text not exposed by
python-pptx, including some SmartArt and grouped-shape text - existing speaker notes
3. Render Slides
Run:
python scripts/render_slides.py "/path/to/deck.pptx" \
--output-dir "<deck-stem>-speaker-output/work/rendered_slides"
The script tries LibreOffice first, then macOS Quick Look. If both fail, use any available local presentation-rendering method and document the limitation.
4. Build The Visual Inventory
Run:
python scripts/visual_inventory.py \
--extract "<deck-stem>-speaker-output/work/slide_extract.json" \
--rendered-dir "<deck-stem>-speaker-output/work/rendered_slides" \
--output "<deck-stem>-speaker-output/work/visual_inventory.json" \
--ocr auto
Use OCR results as evidence, not as unquestioned truth. Correct obvious OCR errors only when the rendered screenshot makes the correction clear.
5. Run Vision Review
Create a vision-review packet:
python scripts/vision_review.py \
--inventory "<deck-stem>-speaker-output/work/visual_inventory.json" \
--output "<deck-stem>-speaker-output/work/vision_review_packet.json" \
--markdown "<deck-stem>-speaker-output/<deck-stem>-vision-review.md"
Then inspect the rendered PNGs with a vision-capable agent, browser screenshot inspection, or equivalent image-review tool. Do not skip this step when slides contain charts, tables, SmartArt, diagrams, screenshots, dense figures, or image-only content.
For each reviewed slide, record:
- visual layout and hierarchy
- visible text not captured by XML
- chart axes, legends, series, and visible values
- diagram nodes, arrows, grouping, and flow
- screenshot UI/document content
- decorative elements that do not need speaking coverage
- uncertain elements that require user confirmation
Save the reviewed findings as <deck-stem>-speaker-output/work/vision_review.json. If no vision-capable tool is available, stop before writing final notes and tell the user which slides cannot be safely interpreted.
6. Inspect Rendered Slides
For every slide with charts, tables, diagrams, SmartArt, screenshots, dense figures, or image-only content, inspect the rendered PNG directly. The inventory is not complete until the visual reading covers:
- all text boxes and titles
- every table header and important cell
- every chart axis, legend, series, label, and visible value that matters
- figure captions, callouts, arrows, annotations, and icons
- SmartArt nodes and relationships
- screenshot text, UI labels, and embedded image text
- citations, footnotes, page numbers, and small labels when they affect interpretation
Use <deck-stem>-speaker-output/work/vision_review.json as required evidence for these slides. If a script result and a rendered screenshot disagree, trust the rendered screenshot and mark the mismatch in coverage notes.
7. Deck Comprehension Brief
After the full deck has been read, show the user a short brief:
- Thesis: one sentence
- Structure: section-by-section argument
- Methods: techniques, models, frameworks, or procedures
- Key parameters: numbers, metrics, datasets, equations, hyperparameters
- Recurring terms: technical terms and named entities
- Visual evidence: charts, tables, screenshots, diagrams, or SmartArt that drive the talk
- Gaps: any element that is visible but not reliably interpretable
If there are material gaps, ask before drafting.
8. Gather Speaker Context
Ask only for missing context:
- speaking duration
- audience and prior knowledge
- occasion
- output language
- output filename, defaulting to
<input>-with-notes.pptx
9. Confirm Narrative Arc
Provide three short lines and get confirmation:
- Opening: how the talk enters the topic
- Middle: the central insight or turning point
- Close: what the audience should know, accept, or do
10. Write Slide Notes
For each slide, produce two versions from the same source:
Display version shown to the user:
[Slide X - Title]
----------------
Spoken text grounded in this slide.
[PAUSE]
[EMPHASIS: term]
Transition: one sentence pointing into the next slide.
Clean version injected into .pptx:
- no slide label
- no separator
- no pause or emphasis markers
- no transition line
Per-slide rules:
- Open with the slide's thesis sentence.
- Address every visible element in the inventory, weighted by importance.
- For charts, state the headline, axes, legend or series, and the specific visible values that support the point.
- For tables, explain what rows and columns represent, then name the comparison that matters.
- For screenshots, identify the visible UI or document state and read important labels.
- For diagrams or SmartArt, explain the nodes, arrows, grouping, and implied flow.
- For equations, name the formula, variables, and role in this work.
- For image-only slides, describe only what the rendered slide supports.
- Keep academic sentences clear and spoken. Prefer sentences under 20 words.
- Avoid filler such as "as we can see", "let me show you", and "moving on".
11. Key Parameters And Methods
After the display notes, include a table:
| Term | Type | Slide(s) | Definition |
|---|
Include methods, models, architectures, datasets, metrics, formulas, acronyms, hyperparameters, and technical terms. Definitions must say both what the term means and how it functions in this deck.
12. Build A Complete Display Document
The display version must not remain only as chat text. Build a complete rehearsal document containing:
- title and deck path
- Deck Comprehension Brief
- Narrative Arc
- Slide-by-Slide Display Notes
- Key Parameters And Methods table
- Timing table
- coverage notes and uncertain visual elements
- injection log placeholder or final injection log
Create <deck-stem>-speaker-output/work/display_document.json with this shape:
{
"title": "Speaker Notes Display Version",
"deck_path": "/path/to/deck.pptx",
"comprehension_brief": {"Thesis": "...", "Structure": "..."},
"narrative_arc": {"Opening": "...", "Middle": "...", "Close": "..."},
"slides": [
{"slide": 1, "title": "Title", "display_notes": "[Slide 1 - Title]\\n..."}
],
"key_parameters_methods": [
{"term": "...", "type": "Method", "slides": "1, 4", "definition": "..."}
],
"timing": [
{"slide": 1, "title": "Title", "time": "0:45", "word_count": 110}
],
"coverage_notes": ["Slide 3 chart labels verified by rendered screenshot."],
"injection_log": []
}
Then run:
python scripts/write_display_docx.py \
--input "<deck-stem>-speaker-output/work/display_document.json" \
--output "<deck-stem>-speaker-output/<deck-stem>-display.docx"
If python-docx is unavailable, the script writes a Markdown fallback next to the requested .docx. Report which output was created.
13. Coverage Quality Check
Before injection, verify:
- every slide has an inventory entry
- every slide has a rendered image or documented render failure
- every visually complex slide has a
work/vision_review.jsonentry - image-only and screenshot-heavy slides received OCR or visual inspection
- every inventory item is covered in display notes or explicitly marked irrelevant
- every chart axis, legend, and important visible value is handled
- every table header and important comparison is handled
- no spoken claim exceeds the slide evidence
- a complete display document was generated
- only user-facing deliverables are at the output root; intermediate JSON and rendered images are under
work/ - clean notes have no labels, separators, pause markers, emphasis markers, or transition lines
work/notes.jsoncovers slides1..N
Fix violations before injection.
14. Inject Notes
Create <deck-stem>-speaker-output/work/notes.json:
[
{"slide": 1, "notes": "Clean spoken text for slide 1."},
{"slide": 2, "notes": "Clean spoken text for slide 2."}
]
Then run:
python scripts/inject_notes.py \
--input "/path/to/deck.pptx" \
--output "<deck-stem>-speaker-output/<deck-stem>-with-notes.pptx" \
--notes "<deck-stem>-speaker-output/work/notes.json" \
--mode replace
Modes:
replace: overwrite existing notesappend: append after existing notesskip-if-present: only fill empty notes panes
After injection, update <deck-stem>-speaker-output/work/display_document.json with the injection log and rerun write_display_docx.py so the display document is complete.
15. Final Delivery
Return:
- PowerPoint with speaker notes:
<deck-stem>-speaker-output/<deck-stem>-with-notes.pptx - Complete display rehearsal document:
<deck-stem>-speaker-output/<deck-stem>-display.docxor.md - Vision-review Markdown:
<deck-stem>-speaker-output/<deck-stem>-vision-review.md - Coverage notes for any uncertain visual element
- Mention that all intermediate evidence files are under
<deck-stem>-speaker-output/work/
Dependency Guidance
Use installed tools first. Do not install packages unless the user approves. Helpful optional tools:
python-pptxfor PowerPoint object extraction and notes injection- LibreOffice or
sofficefor high-quality slide rendering - macOS
qlmanageas a rendering fallback tesseractfor OCRPillowfor image handling- vision-capable inspection tools for rendered slide screenshots
python-docxfor the complete display-version Word document
If a dependency is missing, continue with the strongest available evidence and clearly report the limitation.