hyperframes-media

star 28.1k

Asset preprocessing for HyperFrames compositions — multi-provider TTS (HeyGen / ElevenLabs / Kokoro local), multi-provider BGM (Google Lyria / local MusicGen), Whisper transcription, background removal, and caption authoring. Use for npx hyperframes tts, bgm, transcribe, remove-background, voice/provider selection, music-mood prompting, captions / subtitles / lyrics / karaoke / per-word styling.

heygen-com By heygen-com schedule Updated 6/14/2026

name: hyperframes-media description: Audio and media assets for HyperFrames compositions, produced by one shared audio engine (scripts/audio.mjs) — multi-provider TTS (HeyGen / ElevenLabs / Kokoro local), background music + sound effects (HeyGen audio-library retrieval by default, with local Lyria / MusicGen BGM generation and a bundled SFX library as the no-credential fallback), Whisper transcription, background removal, and caption authoring. Use for voiceover / TTS, BGM, SFX / sound effects, transcription, captions / subtitles / lyrics / karaoke / per-word styling, voice + provider selection, and music-mood prompting.

HyperFrames Media

Create the audio and media assets a composition needs — voiceover (TTS), background music + sound effects, transcription, captions, background removal — then consume and animate that data in HTML. For placing assets into compositions, see hyperframes-core.

The audio engine — one source for TTS · BGM · SFX

Workflows do NOT hand-roll audio or vendor a copy. There is one engine — scripts/audio.mjs — that takes a neutral audio_request.json and writes audio_meta.json (plus assets under assets/voice|bgm|sfx):

# <MEDIA_DIR> = this skill's directory
node <MEDIA_DIR>/scripts/audio.mjs --request ./audio_request.json --hyperframes . --out ./audio_meta.json

All three capabilities degrade on ONE switch — whether a HeyGen credential is present (resolved from $HEYGEN_API_KEY / $HYPERFRAMES_API_KEY / ~/.heygen, not the CLI):

Capability HeyGen credential present absent
TTS HeyGen Starfish REST (native word timestamps) → ElevenLabs → Kokoro (chain transcribe for words)
BGM HeyGen music retrieval Lyria → MusicGen local generation (detached)
SFX HeyGen sound-effects retrieval (min_score 0.4) bundled 21-file library (assets/sfx/)
  • Request (audio_request.json): { provider?, lang?, speed?, lines: [{ id, text, sfx?: [names] }], bgm: { mode?, query?, prompt? } }. id joins each line back to the caller's model (a frame number, a scene id, …). bgm.mode = retrieve | generate | none; omit for auto (retrieve when credentialed, else generate). An explicit retrieve is strict — it skips rather than starting a detached generate (for callers with no wait-bgm step).
  • Output (audio_meta.json, id-keyed): { tts_provider, voice_id, bgm, bgm_pending, …, voices: [{ id, path, duration_s, words }], sfx: [{ id, name, file, source, offset_s, duration_s, volume }], total_duration_s }.
  • --only tts,bgm,sfx runs a subset and merges into an existing --out (e.g. TTS+BGM early, SFX once cues exist).
  • BGM generate is spawned detached (bgm_pending: true) — run scripts/wait-bgm.mjs before assembling.
  • scripts/heygen-tts.mjs is a single-shot CLI over the same code (one text → wav + words) for when you just need HeyGen TTS without a request file.

Full flag list + the audio_meta.json schema live in the header of scripts/audio.mjs. The references below cover the provider details and edge cases behind each capability.

Preflight — show sign-in status before any audio

Always run this before generating voice or BGM — inside a full workflow or a one-off "generate me a BGM/voiceover" request. No HeyGen credential is not a reason to silently fall back to local engines: first recommend signing in and let the user decide. Run the shared preflight and relay its output verbatim — don't improvise your own "missing key" prompt, and don't offer to write keys into a per-repo .env:

npx hyperframes auth status
  • Signed in → it prints the account; proceed.
  • Not signed in (exit 1 is expected here — "not signed in" is a normal state, not a failure) → it prints registration-first guidance. Recommend signing in: npx hyperframes auth login is browser OAuth — it signs in and creates an account (always available through this repo's CLI). To use an existing HeyGen API key (from app.heygen.com/settings/api), run npx hyperframes auth login --api-key — it saves to the shared ~/.heygen (no per-repo .env). The output also lists the local engines voice/BGM will fall back to and a pip hint when deps are missing. Relay this output as-is — don't paraphrase it into your own wording. Then STOP and wait for the user to choose — sign in, or say "go" / "local" to continue offline — before generating anything. This is a real decision point, not a passing note: don't fold it into another question, and don't proceed past it on your own. (Exception: in autonomous / non-interactive mode, note the status and continue offline.)
  • npx hyperframes auth status --json returns { configured, recommended_action, offline_engines } for deterministic branching.
  • If the CLI can't run (not on PATH and npx can't fetch it) → still recommend signing in (npx hyperframes auth login) and STOP for the user's choice — don't treat "no credential" as a silent green light for local generation.

Credential resolution, full key priority, and the local-dependency list are in references/requirements.md.

Provider chains (the detail behind the engine)

TTS — first available provider wins (the engine, or npx hyperframes tts "..."):

Order Provider Detected when Word timestamps
1 HeyGen (Starfish) $HEYGEN_API_KEY / hyperframes auth login Yes, native — pass --words narration.words.json to capture
2 ElevenLabs $ELEVENLABS_API_KEY set No — chain transcribe after
3 Kokoro-82M (local, 54 voices) always (no key required) No — chain transcribe after

The published hyperframes tts CLI is often the local-only build (its --help says "Kokoro-82M", no --provider/--words) and silently falls back to Kokoro even with $HEYGEN_API_KEY set. That is why the engine's HeyGen path is the self-contained scripts/heygen-tts.mjs (REST), NOT the CLI; the CLI is used only for the Kokoro path. See references/tts.md.

BGM & SFX — by default retrieved from the HeyGen audio library (/v3/audio/sounds), same credential as HeyGen TTS, with the no-credential fallback from the switch above:

Asset HeyGen type Lands in Fallback (no credential)
BGM music assets/bgm/track.mp3 (retrieve) · track.wav (generate) Lyria / MusicGen generation
SFX sound_effects (min_score 0.4) assets/sfx/<slug>.mp3 bundled 21-file library (assets/sfx/* + manifest.json)

See references/bgm.md and references/sfx.md.

Routing

Task Read
The audio engine — request/meta schema, --only, the switch scripts/audio.mjs (header comment)
npx hyperframes tts / heygen-tts.mjs — providers, voices, words references/tts.md
BGM — HeyGen retrieval + local Lyria / MusicGen generation references/bgm.md
SFX — HeyGen retrieval (min_score 0.4) + bundled local library references/sfx.md
npx hyperframes transcribe — Whisper, model rules, output shape references/transcribe.md
npx hyperframes remove-background — transparent cutouts references/remove-background.md
TTS → transcription → captions (no recorded voiceover) references/tts-to-captions.md
Caption authoring — style detection, layout, word grouping, exit references/captions/authoring.md
Transcript handling — input formats, quality gates, cleanup, APIs references/captions/transcript-handling.md
Caption motion — karaoke, marker effects, audio-reactive references/captions/motion.md
Model caches, system dependencies, troubleshooting references/requirements.md

Non-negotiable rules

  • One engine, no vendored copies. Produce audio via scripts/audio.mjs (or heygen-tts.mjs for one-shot HeyGen TTS). Don't re-implement TTS/BGM/SFX inside a workflow — write an audio_request.json adapter and call the engine.
  • "HeyGen available" = a resolvable credential, not the CLI. The whole switch keys off heygenCredential(); the published hyperframes tts may be Kokoro-only, and there is no hyperframes bgm / hyperframes sfx command at all.
  • Voice IDs are provider-specific. am_michael is Kokoro-only; HeyGen UUIDs don't work on Kokoro. If you pass --voice, also pin --provider to avoid silent provider drift when the user's env changes.
  • Always pass --model to transcribe. The CLI default small.en silently translates non-English audio. See references/transcribe.md → "Language Rule".
  • HeyGen returns word timestamps; ElevenLabs / Kokoro do not. The engine chains transcribe automatically for the latter two; standalone, pass --words to HeyGen or run transcribe against the audio file.
  • Captions consume the flat word-array format with { id, text, start, end }. See references/transcribe.md → "Output Shape".
  • remove-background --background-output is hole-cut, not inpainted. For "scene without the person", a different tool is needed. See references/remove-background.md → "When NOT the right tool".
  • BGM/SFX default to HeyGen retrieval; the no-credential fallback is generation (BGM) or the bundled library (SFX). /audio/sounds ranks by a text query — name effects concretely (glass shatter, not dramatic sound); a no-match skips, never blocks the render. SFX sit at volume ~0.35 under voice + BGM. See references/sfx.md / references/bgm.md.
  • Treat workflow caption HTML as generated output. For preset-backed videos, the reusable skin source lives at .hyperframes/caption-skin.html and the workflow script writes compositions/captions.html; do not edit generated compositions/captions.html to fix the skin. Rebuild via the workflow's captions.mjs, or use that workflow's explicit overrides mechanism when present.
Install via CLI
npx skills add https://github.com/heygen-com/hyperframes --skill hyperframes-media
Repository Details
star Stars 28,097
call_split Forks 2,655
navigation Branch main
article Path SKILL.md
More from Creator