name: hyperframes-media
description: Audio and media assets for HyperFrames compositions, produced by one shared audio engine (scripts/audio.mjs) — multi-provider TTS (HeyGen / ElevenLabs / Kokoro local), background music + sound effects (HeyGen audio-library retrieval by default, with local Lyria / MusicGen BGM generation and a bundled SFX library as the no-credential fallback), Whisper transcription, background removal, and caption authoring. Use for voiceover / TTS, BGM, SFX / sound effects, transcription, captions / subtitles / lyrics / karaoke / per-word styling, voice + provider selection, and music-mood prompting.
HyperFrames Media
Create the audio and media assets a composition needs — voiceover (TTS), background music + sound effects, transcription, captions, background removal — then consume and animate that data in HTML. For placing assets into compositions, see hyperframes-core.
The audio engine — one source for TTS · BGM · SFX
Workflows do NOT hand-roll audio or vendor a copy. There is one engine — scripts/audio.mjs — that takes a neutral audio_request.json and writes audio_meta.json (plus assets under assets/voice|bgm|sfx):
# <MEDIA_DIR> = this skill's directory
node <MEDIA_DIR>/scripts/audio.mjs --request ./audio_request.json --hyperframes . --out ./audio_meta.json
All three capabilities degrade on ONE switch — whether a HeyGen credential is present (resolved from $HEYGEN_API_KEY / $HYPERFRAMES_API_KEY / ~/.heygen, not the CLI):
| Capability | HeyGen credential present | absent |
|---|---|---|
| TTS | HeyGen Starfish REST (native word timestamps) | → ElevenLabs → Kokoro (chain transcribe for words) |
| BGM | HeyGen music retrieval | Lyria → MusicGen local generation (detached) |
| SFX | HeyGen sound-effects retrieval (min_score 0.4) | bundled 21-file library (assets/sfx/) |
- Request (
audio_request.json):{ provider?, lang?, speed?, lines: [{ id, text, sfx?: [names] }], bgm: { mode?, query?, prompt? } }.idjoins each line back to the caller's model (a frame number, a scene id, …).bgm.mode=retrieve | generate | none; omit for auto (retrieve when credentialed, else generate). An explicitretrieveis strict — it skips rather than starting a detached generate (for callers with nowait-bgmstep). - Output (
audio_meta.json, id-keyed):{ tts_provider, voice_id, bgm, bgm_pending, …, voices: [{ id, path, duration_s, words }], sfx: [{ id, name, file, source, offset_s, duration_s, volume }], total_duration_s }. --only tts,bgm,sfxruns a subset and merges into an existing--out(e.g. TTS+BGM early, SFX once cues exist).- BGM generate is spawned detached (
bgm_pending: true) — runscripts/wait-bgm.mjsbefore assembling. scripts/heygen-tts.mjsis a single-shot CLI over the same code (one text → wav + words) for when you just need HeyGen TTS without a request file.
Full flag list + the audio_meta.json schema live in the header of scripts/audio.mjs. The references below cover the provider details and edge cases behind each capability.
Preflight — show sign-in status before any audio
Always run this before generating voice or BGM — inside a full workflow or a one-off "generate me a BGM/voiceover" request. No HeyGen credential is not a reason to silently fall back to local engines: first recommend signing in and let the user decide. Run the shared preflight and relay its output verbatim — don't improvise your own "missing key" prompt, and don't offer to write keys into a per-repo .env:
npx hyperframes auth status
- Signed in → it prints the account; proceed.
- Not signed in (
exit 1is expected here — "not signed in" is a normal state, not a failure) → it prints registration-first guidance. Recommend signing in:npx hyperframes auth loginis browser OAuth — it signs in and creates an account (always available through this repo's CLI). To use an existing HeyGen API key (from app.heygen.com/settings/api), runnpx hyperframes auth login --api-key— it saves to the shared~/.heygen(no per-repo.env). The output also lists the local engines voice/BGM will fall back to and apiphint when deps are missing. Relay this output as-is — don't paraphrase it into your own wording. Then STOP and wait for the user to choose — sign in, or say "go" / "local" to continue offline — before generating anything. This is a real decision point, not a passing note: don't fold it into another question, and don't proceed past it on your own. (Exception: in autonomous / non-interactive mode, note the status and continue offline.) npx hyperframes auth status --jsonreturns{ configured, recommended_action, offline_engines }for deterministic branching.- If the CLI can't run (not on PATH and
npxcan't fetch it) → still recommend signing in (npx hyperframes auth login) and STOP for the user's choice — don't treat "no credential" as a silent green light for local generation.
Credential resolution, full key priority, and the local-dependency list are in references/requirements.md.
Provider chains (the detail behind the engine)
TTS — first available provider wins (the engine, or npx hyperframes tts "..."):
| Order | Provider | Detected when | Word timestamps |
|---|---|---|---|
| 1 | HeyGen (Starfish) | $HEYGEN_API_KEY / hyperframes auth login |
Yes, native — pass --words narration.words.json to capture |
| 2 | ElevenLabs | $ELEVENLABS_API_KEY set |
No — chain transcribe after |
| 3 | Kokoro-82M (local, 54 voices) | always (no key required) | No — chain transcribe after |
The published
hyperframes ttsCLI is often the local-only build (its--helpsays "Kokoro-82M", no--provider/--words) and silently falls back to Kokoro even with$HEYGEN_API_KEYset. That is why the engine's HeyGen path is the self-containedscripts/heygen-tts.mjs(REST), NOT the CLI; the CLI is used only for the Kokoro path. Seereferences/tts.md.
BGM & SFX — by default retrieved from the HeyGen audio library (/v3/audio/sounds), same credential as HeyGen TTS, with the no-credential fallback from the switch above:
| Asset | HeyGen type |
Lands in | Fallback (no credential) |
|---|---|---|---|
| BGM | music |
assets/bgm/track.mp3 (retrieve) · track.wav (generate) |
Lyria / MusicGen generation |
| SFX | sound_effects (min_score 0.4) |
assets/sfx/<slug>.mp3 |
bundled 21-file library (assets/sfx/* + manifest.json) |
See references/bgm.md and references/sfx.md.
Routing
| Task | Read |
|---|---|
The audio engine — request/meta schema, --only, the switch |
scripts/audio.mjs (header comment) |
npx hyperframes tts / heygen-tts.mjs — providers, voices, words |
references/tts.md |
| BGM — HeyGen retrieval + local Lyria / MusicGen generation | references/bgm.md |
| SFX — HeyGen retrieval (min_score 0.4) + bundled local library | references/sfx.md |
npx hyperframes transcribe — Whisper, model rules, output shape |
references/transcribe.md |
npx hyperframes remove-background — transparent cutouts |
references/remove-background.md |
| TTS → transcription → captions (no recorded voiceover) | references/tts-to-captions.md |
| Caption authoring — style detection, layout, word grouping, exit | references/captions/authoring.md |
| Transcript handling — input formats, quality gates, cleanup, APIs | references/captions/transcript-handling.md |
| Caption motion — karaoke, marker effects, audio-reactive | references/captions/motion.md |
| Model caches, system dependencies, troubleshooting | references/requirements.md |
Non-negotiable rules
- One engine, no vendored copies. Produce audio via
scripts/audio.mjs(orheygen-tts.mjsfor one-shot HeyGen TTS). Don't re-implement TTS/BGM/SFX inside a workflow — write anaudio_request.jsonadapter and call the engine. - "HeyGen available" = a resolvable credential, not the CLI. The whole switch keys off
heygenCredential(); the publishedhyperframes ttsmay be Kokoro-only, and there is nohyperframes bgm/hyperframes sfxcommand at all. - Voice IDs are provider-specific.
am_michaelis Kokoro-only; HeyGen UUIDs don't work on Kokoro. If you pass--voice, also pin--providerto avoid silent provider drift when the user's env changes. - Always pass
--modeltotranscribe. The CLI defaultsmall.ensilently translates non-English audio. Seereferences/transcribe.md→ "Language Rule". - HeyGen returns word timestamps; ElevenLabs / Kokoro do not. The engine chains
transcribeautomatically for the latter two; standalone, pass--wordsto HeyGen or runtranscribeagainst the audio file. - Captions consume the flat word-array format with
{ id, text, start, end }. Seereferences/transcribe.md→ "Output Shape". remove-background --background-outputis hole-cut, not inpainted. For "scene without the person", a different tool is needed. Seereferences/remove-background.md→ "When NOT the right tool".- BGM/SFX default to HeyGen retrieval; the no-credential fallback is generation (BGM) or the bundled library (SFX).
/audio/soundsranks by a text query — name effects concretely (glass shatter, notdramatic sound); a no-match skips, never blocks the render. SFX sit at volume ~0.35 under voice + BGM. Seereferences/sfx.md/references/bgm.md. - Treat workflow caption HTML as generated output. For preset-backed videos, the reusable skin source lives at
.hyperframes/caption-skin.htmland the workflow script writescompositions/captions.html; do not edit generatedcompositions/captions.htmlto fix the skin. Rebuild via the workflow'scaptions.mjs, or use that workflow's explicit overrides mechanism when present.