25-voice-clone-podcast-global - SKILL.md Agent Skill

name: 25-voice-clone-podcast-global description: "Audio AI for global personal brand — voice clone (ElevenLabs, Murf, PlayHT), podcast, audiobook, voiceover. 3 use cases: short voiceover (TikTok/Reels), podcast 30-60min, audiobook. 1:10 repurpose (1 podcast → 10 short clips). English (US/UK/AU/SG accents available). Trigger: 'voice clone global', 'ElevenLabs', 'podcast AI', 'audiobook AI', 'voiceover AI'." metadata: version: 1.0.0 category: content license: MIT triggers: - "voice clone global" - "ElevenLabs" - "podcast AI" - "audiobook AI" - "voiceover AI" related: - 24-ai-avatar-production-global - 26-thought-leadership-content-global - references/voice-clone-prompts-global - references/ai-video-disclosure-global

Voice Clone & Podcast — Audio AI for Personal Brand (Global)

This skill focuses on audio AI — voice clone, podcast, audiobook, voiceover. Pairs with 24-ai-avatar-production-global (video) — combine both for full content stack coverage.

1. Newbie Guide

What is audio AI and how is it different from video AI?

Audio AI is the tech behind synthetic voices that sound nearly human — from a sample of your voice, AI learns and produces a synthetic clone (voice clone). You write text -> AI reads it back (Text-to-Speech).

Differences vs video AI:

Video AI (skill 24): produces video with face + voice -> talking head, social video
Audio AI (this skill): produces voice only -> podcast, audiobook, voiceover, narration

When to use audio AI instead of video?

Situation	Pick audio AI	Pick video AI
Long-form content (>10 min)	YES — podcast format	NO — too long for video
Don't want to be on camera	YES	NO
Need volume content fast	YES — 1 podcast = 10 shorts	YES but more expensive
Audience listens while driving / at gym	YES	NO
Need visuals to demo	NO	YES
Personal brand thought leader	YES — podcast = authority	YES — if face brand exists

Main tools (international)

ElevenLabs: Best in class for voice clone — top-tier English voices (US/UK/AU/IN), 30+ languages
Murf: 120+ voice library, strong for corporate voiceover, multilingual
PlayHT: API-friendly, instant clone, 800+ voices
HeyGen Voice: Bundles with HeyGen avatars — seamless voice + video pipeline
Descript: AI editing — cut audio by editing text, voice clone (Overdub)
Resemble.ai: Custom emotion control, brand-grade APIs
Riverside: Studio-quality podcast recording with AI Magic Clips repurpose

Time and cost

Task	Time	Cost (USD/mo)
Voice clone setup	30-60 min	$5-22 (ElevenLabs Starter/Pro)
60s voiceover (TikTok)	5-10 min	$5-22
30 min solo podcast	1-2 hrs	$22-99 (ElevenLabs + Riverside)
Audiobook chapter (15 min)	30-45 min	$22-99
1 podcast -> 10 clips	1-2 hrs	$0-30 (Descript/Opus Clip)

5 common mistakes

AI voice sounds robotic: sample too short or monotonic. Fix: re-record 3-5 minutes with varied emotions (happy, serious, sad).
Mispronounced names/jargon: TTS engines mishandle proper nouns. Fix: use phonetic spelling (e.g., "Anthropic" -> "an-THROW-pic") in the script.
Audio clipping: levels too hot. Fix: target -3dB peak, -16 LUFS loudness.
Background noise/echo: untreated room. Fix: small room with curtains and rugs, or apply NVIDIA Broadcast / Krisp / Adobe Enhance Speech.
Boring podcast: no editing, too many "ums". Fix: Descript auto-removes filler words, add light background music (-25dB).

2. Information collection

Ask up to 4 questions before starting:

Main use case? Short voiceover (TikTok/Reels) / Podcast 30-60 min / Audiobook?
Language(s)? English (US/UK/AU/IN) / multilingual / single non-English?
Total length? <60s / 5-30 min / 30-60 min / >60 min (audiobook)?
Budget tier? Free ($0) / Starter ($5-22) / Pro ($22-99) / Business ($99+)?

Based on the answers, pick the appropriate use case + tool stack.

3. Voice clone setup

Sample requirements

Criterion	Minimum	Optimal
Length	1 min (Free tier)	3-5 min (Pro tier)
Room	Quiet, no echo	Acoustic treatment, rugs, curtains
Mic	Phone + headset mic	Condenser mic (AT2020, $80-100)
Distance	20-30 cm	15-20 cm with pop filter
Format	MP3 128 kbps	WAV 44.1 kHz
Content	One pre-written passage	Three passages: business / casual / emotional

Full reference: references/voice-clone-prompts-global.md — sample scripts across English variants (US/UK/AU/SG/IN) and 3 topics (business / lifestyle / educational).

Tool comparison (global)

Tool	English clone quality	Price/mo	Setup time	Best for
ElevenLabs Pro	Excellent (10/10)	$22	30 min	Multilingual, content creator
HeyGen Voice	Good (8/10)	Bundled with avatar	15 min	Combo with video AI
Murf	Excellent (9/10)	$29-79	30 min	Corporate voiceover, e-learning
PlayHT	Excellent (9.5/10)	$39-99	30 min	API-driven, instant clone
Descript Overdub	Good (8/10)	$24 (Hobbyist)	30 min	Podcast editing
Resemble.ai	Excellent (9/10)	$30-99	1 hr	Brand custom voice, emotion control

Recommendations:

English-only creator: ElevenLabs Pro ($22) — best balance of quality and price
Multilingual creator: ElevenLabs Pro (30+ languages built in)
Combo with video: HeyGen (single platform — voice + avatar)
Brand/agency at scale: Resemble.ai or PlayHT (API + custom emotion)

Consent form template

VOICE CLONE LICENSE AGREEMENT

I, [Full name], ID/passport: [number], grant [Brand/Company]:
1. Permission to use samples of my voice to create an AI voice clone.
2. Use of the voice clone in [scope: internal / advertising / podcast / etc.].
3. Term: from [DD/MM/YYYY] to [DD/MM/YYYY].
4. Right of withdrawal: I may request deletion of the voice clone at any time
   in writing; the brand has 7 days to fully remove it.
5. Disclosure: the brand commits to disclose "AI-generated voice" wherever
   required by applicable law (FTC, EU AI Act, etc.).

Signed: ____________   Date: ____________

4. Three use cases

Use case A: Short voiceover for TikTok/Reels (Energetic)

Spec:

Length: 15-60s
Pace: fast (180-220 wpm) — younger English-speaking audience
Tone: energetic, slightly higher pitch, exciting
Audio levels: -14 LUFS (TikTok), peak -1 dB
CTA: clear in the last 5 seconds

Script template (30s):

[HOOK 0-3s] "Did you know [shocking stat]?"
[PROBLEM 3-10s] "Most people are still stuck in [wrong loop]"
[SOLUTION 10-22s] "I tried [method], and here are 3 things..."
[PAYOFF 22-27s] "Result: [specific number]"
[CTA 27-30s] "Comment 'YES' to get the full breakdown"

Voice settings (ElevenLabs):

Stability: 35-45 (low — allows variation)
Similarity: 75-85
Style: 50-65 (boost expressiveness)
Speaker Boost: ON

Use case B: Podcast 30-60 min (Conversational)

Structure:

Intro (1-2 min): hook + introduce topic + welcome listeners
Body (25-50 min): 3-5 main segments, each 5-10 min
Ad slot (optional): 3-5 min after intro, or mid-body
Outro (1-2 min): recap + CTA + thanks

Pacing:

Conversational pace: 140-160 wpm
1-2s pause after important sentences
Segment transitions: 2-3s pause + audio sting

Sound design:

Background music: -25 to -30 dB (very subtle)
Stings/transitions: -15 dB, 1-2s
Voice levels: -16 LUFS (podcast standard), peak -1 dB

Voice settings (ElevenLabs):

Stability: 60-75 (high — consistent across 30+ minutes)
Similarity: 85-95
Style: 30-40 (natural, not over-expressive)
Speaker Boost: ON

Use case C: Audiobook (Mid-tempo)

Structure:

Chapter intro: "Chapter [X]: [Title]" — 2s pause
Chapter body: 10-20 min/chapter, 1s pause between paragraphs
Chapter end: 3s pause before next chapter

Pacing:

Mid-tempo: 150-170 wpm
Natural breath every 2-3 sentences
Dialogue: subtle voice shifts per character (fiction)

Consistency check (most important):

Render Chapter 1 and Chapter 5 -> compare voice -> must match 95%+
If voices drift: re-clone with a longer sample (5+ min)
Pronunciation guide: build a database of proper nouns + custom phonetics

Voice settings (ElevenLabs):

Stability: 70-85 (very high — consistent for hours)
Similarity: 90-95
Style: 20-30 (calm, even)
Speaker Boost: ON

5. Tool comparison (global)

Tool	Price/mo	English quality	Multilingual	Setup	Pros	Cons	Best for
ElevenLabs	$5-99	10/10	30+ langs	30 min	Best clone, multilingual	Pricier high tiers	Multilingual creator
HeyGen Voice	Bundle w/ avatar	8/10	40+ langs	15 min	Combo with avatar	Voice clone less expressive	Combo with video
Descript	$24-30	9/10	EN focus	30 min	Audio editing first	Multilingual weaker	Podcast editing
Riverside	$19-29	n/a (recording)	n/a	5 min	Studio recording	Not TTS	Live podcast
Murf	$29-79	9/10	20+ langs	30 min	120+ voice library	Voice clone limited tier	Corporate voiceover
PlayHT	$39-99	9.5/10	100+ langs	30 min	Strong API, instant clone	UI dense	Developer/API
Resemble.ai	$30-99	9/10	60+ langs	1 hr	Custom emotion control	Steep learning curve	Brand custom voice

Recommended combos 2025-2026:

English solo creator: ElevenLabs Pro ($22) + Riverside Free + Descript Hobbyist ($24)
Multilingual creator: ElevenLabs Pro ($22) + Riverside Standard ($19) + Descript Pro ($30)
Brand/agency: ElevenLabs Creator ($99) + Resemble.ai + Riverside Pro ($29)

6. 1-on-1 podcast with an AI co-host

Use case: solo podcaster who wants conversational format but can't find a co-host. AI co-host = a second AI voice that asks questions while you answer.

Setup — prompt-engineering the AI personality

Step 1: Define the AI co-host's personality

Name: [AI co-host name]
Personality: curious, asks deep follow-ups, occasionally light humor
Role: asks the host questions, doesn't talk too much
Speaking style: casual, natural, addresses the host by first name
Knowledge level: average — asks questions like a listener would
Catchphrases: "Wow, that's wild." / "What does that mean exactly?" / "Can you go deeper?"

Step 2: Create a separate voice clone for the AI co-host

Use a different voice than the host (e.g., woman vs man, or different accent)
Clone from a consenting friend, or use one of ElevenLabs' built-in voices

Step 3: Tool stack

ElevenLabs: generate the AI co-host voice
Riverside: record the host live
Descript: edit + splice in the AI co-host (text-to-audio)

Q&A script template

[INTRO]
Host: Hey everyone, today [AI co-host] and I are diving into...
AI co-host: Hi all, I'm [name]. Today I want to dig into [topic] from [host]'s
            point of view. Let's go!

[BODY — 5-7 Q&A pairs]
AI co-host: [Broad opening question]
Host: [Answers 2-3 minutes]
AI co-host: [Deeper follow-up]
Host: [Answers with a concrete example]
... repeat 5-7 times ...

[OUTRO]
AI co-host: Thanks [host] for sharing. The biggest thing I learned was...
Host: Thanks [AI co-host]. If you have questions, drop them in the comments...

Tip: pre-write 7-10 AI co-host questions in a doc, record host responses in one go. Then generate AI co-host audio in ElevenLabs and splice in via Descript.

7. Repurpose pipeline 1:10 (1 podcast -> 10 short clips)

Workflow overview

[1] Record 60-min podcast (Riverside)
        v
[2] Auto-transcript (Descript / Riverside)
        v
[3] Identify hooks (10-15 quotable lines)
        v
[4] Cut 30-60s clips per quote (Opus Clip / Descript)
        v
[5] Add captions (auto-caption)
        v
[6] Distribute across 4 platforms

How to identify hooks

Find moments in the transcript with these traits:

Bold statement: "I think 90% of founders are doing this wrong"
Counter-intuitive: "Raising prices actually grew revenue"
Specific number: "Went from $0 to $1M in 6 months"
Personal story: "On my first startup I lost $200K"
Actionable tip: "3 specific steps you can take today"

Target: 10-15 hooks per 60-min podcast. Pick the 10 best.

Tool stack

Descript: auto-clip — select sentences, export short clips (free tier 1 hr/month)
Opus Clip: AI auto-finds viral moments + auto-format vertical/horizontal ($19-99)
Riverside Magic Clips: built-in to Riverside Pro ($29)
CapCut + ChatGPT: manual but free — paste transcript, ChatGPT extracts hook candidates

Distribution across 4 platforms

Platform	Format	Length	Caption	Bonus
TikTok	9:16 (1080×1920)	30-60s	Bold caption on top	Trend audio overlay (low volume)
Instagram Reels	9:16	15-90s	Clean subtitle, sans-serif font	Strong cover image
YouTube Shorts	9:16	<60s	Auto-caption	Title with target keyword
LinkedIn audio	1:1 (square video w/ audio)	60-120s	Subtitle below	Long-form thread (carousel)

Pro tip: each clip should target one platform with platform-specific captions and cover image. Maximizes reach.

8. Audio QA + Disclosure

5 QA criteria

Clarity (10 pts): voice clear, no rasp, no stuttering. Test: play on phone speaker, still intelligible.
No clipping (10 pts): peak below -1 dB. Tools: Audacity, Adobe Audition, Reaper.
No background noise (10 pts): no fans, traffic, neighbors. Tools: Krisp, NVIDIA Broadcast, Adobe Enhance Speech.
Consistent loudness (10 pts): stable -16 LUFS (podcast) or -14 LUFS (TikTok). Tool: loudness meter in DAW.
Natural pauses (10 pts): human-like pacing. Manual review: listen 3 times.

Pass: 40+/50. Below 40 = re-render or re-record.

Global disclosure — when required

Situation	Disclosure	Placement
Commercial advertising	REQUIRED	Caption + end of audio ("This audio uses an AI voice clone")
Personal brand podcast	RECOMMENDED — transparency	Episode description
Fiction audiobook	OPTIONAL	Optional — credits at end
News/educational	REQUIRED	Beginning of audio + caption
Internal corporate content	NOT REQUIRED	n/a

Disclosure caption template:

This audio uses AI voice cloning technology
(ElevenLabs / Murf / [tool name]). Content was written and reviewed by [Name].

Full reference: references/ai-video-disclosure-global.md — FTC, EU AI Act, FCC, and OFCOM requirements; 3-tier disclosure framework, situational templates (also applies to audio).

9. Quality checklist

Before publishing audio:

Voice-clone sample 3-5 min, quiet room
Signed consent form (if cloning someone else's voice)
Use case matches: voiceover (energetic) / podcast (conversational) / audiobook (mid-tempo)
Voice settings appropriate for use case (Stability/Similarity/Style)
Loudness correct: -14 LUFS (TikTok) / -16 LUFS (podcast/audiobook)
Peak below -1 dB (no clipping)
No background noise (Krisp/NVIDIA Broadcast pass)
Pacing correct: 180-220 wpm (TikTok) / 140-160 wpm (podcast) / 150-170 wpm (audiobook)
QA Score 40+/50
Disclosure caption (if commercial use)
Repurpose plan: 1 podcast -> 10 clips across 4 platforms

Skill 25 (Global) | v1.0.0