name: tts-voiceover description: 'Text-to-speech voice-over generation from YAML speaker notes using Azure Speech SDK with SSML pronunciation control' metadata: authors: "microsoft/hve-core" spec_version: "1.0"
TTS Voice Over Skill
Generates per-slide WAV voice-over files from YAML speaker_notes using Azure Speech SDK with SSML pronunciation control.
Overview
This skill reads content.yaml files from a PowerPoint skill content directory, extracts speaker_notes fields, applies SSML acronym aliases for correct pronunciation of technical terms, and produces one WAV file per slide. Supports dry-run mode for SSML template verification without Azure credentials.
Prerequisites
- Azure Speech resource — Free tier provides 500K characters per month.
- Authentication — Key-based (
SPEECH_KEY) or Microsoft Entra ID (SPEECH_RESOURCE_ID). - Python 3.11+ with
uvfor virtual environment management.
Key-Based Auth
export SPEECH_KEY="your-speech-key"
export SPEECH_REGION="eastus"
Microsoft Entra ID Auth
Requires a custom domain on the Speech resource and Cognitive Services Speech User role.
export SPEECH_RESOURCE_ID="/subscriptions/.../Microsoft.CognitiveServices/accounts/your-resource"
export SPEECH_REGION="eastus"
Install dependencies:
# run from this skill folder
uv sync
Quick Start
Verify SSML templates without generating audio:
uv run scripts/generate_voiceover.py --dry-run --content-dir path/to/content
Generate voice-over WAV files:
uv run scripts/generate_voiceover.py --content-dir path/to/content --output-dir voice-over
Embed audio into a PPTX deck:
uv run scripts/embed_audio.py --input deck.pptx --audio-dir voice-over --output deck-narrated.pptx
Parameters Reference
generate_voiceover.py
| Parameter | Type | Default | Description |
|---|---|---|---|
--dry-run |
flag | false |
Print SSML templates without generating audio |
--voice |
string | en-US-Andrew:DragonHDLatestNeural |
Azure TTS voice name |
--rate |
string | +10% |
Speech prosody rate |
--content-dir |
path | content |
Path to slide content directory |
--output-dir |
path | voice-over |
Path to WAV output directory |
--lexicon |
path | (auto-detect) | Custom acronyms.yaml path |
--verbose / -v |
flag | false |
Enable verbose (DEBUG) logging output |
embed_audio.py
Embeds WAV files into corresponding PPTX slides and adds narration timing XML so PowerPoint recognizes the audio for video export via File > Export > Create a Video > Use Recorded Timings and Narrations.
| Parameter | Type | Default | Description |
|---|---|---|---|
--input |
path | (required) | Source PPTX file path |
--audio-dir |
path | voice-over |
Directory with slide-NNN.wav |
--output |
path | *-narrated.pptx |
Output PPTX file path |
--verbose / -v |
flag | false |
Enable verbose (DEBUG) logging output |
Script Reference
Generate with custom voice and rate:
uv run scripts/generate_voiceover.py \
--content-dir content \
--output-dir voice-over \
--voice "en-US-Jenny:DragonHDLatestNeural" \
--rate "+5%"
Use a custom lexicon:
uv run scripts/generate_voiceover.py \
--content-dir content \
--lexicon custom-acronyms.yaml
Embed generated audio:
uv run scripts/embed_audio.py \
--input slide-deck/presentation.pptx \
--audio-dir voice-over \
--output slide-deck/presentation-narrated.pptx
Acronym Lexicon
The lexicon controls SSML <sub alias> replacements for acronyms and technical terms. Create an acronyms.yaml file:
acronyms:
HVE-Core: "H V E Core"
OWASP: "Oh wasp"
SBOM: "S Bomb"
SLSA: "Salsa"
CI/CD: "C I C D"
Lexicon resolution order:
- Path specified via
--lexiconargument. acronyms.yamlin the content directory.- Built-in defaults covering common technical acronyms.
SSML Template
Each slide produces an SSML document:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="http://www.w3.org/2001/mstts" xml:lang="en-US">
<voice name="en-US-Andrew:DragonHDLatestNeural">
<prosody rate="+10%">
Text with <sub alias="Oh wasp">OWASP</sub> aliases applied.
</prosody>
</voice>
</speak>
Integration with PowerPoint Skill
This skill reads from the PowerPoint skill's content directory structure:
content/
├── slide-001/
│ └── content.yaml # Must include speaker_notes: field
├── slide-002/
│ └── content.yaml
└── ...
Each content.yaml should contain a speaker_notes: field with the narration text. The generated WAV files are named slide-NNN.wav matching the directory names.
Troubleshooting
| Issue | Solution |
|---|---|
Set SPEECH_KEY ... or SPEECH_RESOURCE_ID |
Export SPEECH_KEY (key auth) or SPEECH_RESOURCE_ID (Entra ID) with SPEECH_REGION. |
| 401 with Entra ID auth | Verify custom domain on the Speech resource and Cognitive Services Speech User role. RBAC propagation takes up to 5 minutes. |
| Empty WAV files or skipped slides | Verify speaker_notes: is present and non-empty in content.yaml. |
| Mispronounced acronyms | Add entries to acronyms.yaml with phonetic aliases. |
azure-cognitiveservices-speech package is required |
Run uv sync in the skill directory. |
| Audio icon visible in PPTX | Reposition or resize the audio object in PowerPoint after embedding. |
| Authored slide animations missing after embedding | embed_audio.py replaces existing p:timing with narration timing; re-apply animations in PowerPoint after embedding audio. |
| Slides no longer advance on click after embedding | embed_audio.py sets advClick="0" for auto-advance. To re-enable, select all slides in PowerPoint and check Advance Slide > On Mouse Click in the Transitions tab. |
| Video export shows "No timings recorded" | Re-embed audio with the updated embed_audio.py which adds narration timing XML automatically. |
Brought to you by microsoft/hve-core