name: Voice Mastery (Sonic Identity) description: "The Voice God Protocol". Mastering Speech-to-Speech, Empathic AI, and Video-to-Audio to bring characters to life.
Voice Mastery: The Sonic Identity Engine
Core Philosophy
We do not use "Text-to-Speech" (Robotic). We use "Speech-to-Speech" (Sentient). This system is the blueprint for Project Genesis: The eventual ability for App Owners to build their own brand characters.
The Triple-Layer Protocol
1. The "Acting" Layer (Cinematic Assets)
For fixed, high-value assets (Intro Video, Welcome Message):
- Tool: Respeecher or ElevenLabs STS.
- Workflow:
- Reference: Record the line yourself (or hire a voice actor) with the exact intended emotion (whisper, excitement, authority).
- Conversion: Use the AI to "skin" this reference with the Angelina Model.
- Result: 100% emotional fidelity, 0% robotic flattening.
2. The "Empathic" Layer (Real-Time Interaction)
For dynamic chat and user interaction:
- Tool: Hume AI (EVI).
- Workflow:
- Listen: Analyze User Audio for prosody (tone/speed/pitch).
- Respond: Generate audio that matches or balances the user's state.
- Latency: Optimize for <500ms response time to feel "alive".
3. The "Soundscape" Layer (Video Immersion)
For video generative assets (Whisk output):
- Tool: Google DeepMind V2A (Video-to-Audio).
- Workflow:
- Input: Feed the "Whisk" generated video (e.g., Sizzling Steak, Laughing Angelina).
- Generate: Extract pixel-perfect synchronized audio (sizzle, clinking, breath).
- Mix: Blend with the Voice track for a cinema-grade result.
Future: The Character Builder
Everything done for Angelina must be:
- Reproducible: Documented seeds/settings.
- Templated: Ready to be applied to "John" and future user-created characters.