generate-music - SKILL.md Agent Skill

name: generate_music description: Generate original music from a text description and optional lyrics using ACE-Step v1.5 (local GPU, ~1–12s for 30s of audio).

generate_music

Generate original music locally using ACE-Step v1.5, a flow-matching diffusion model. Runs on the local RTX 3090 GPU. No API keys needed.

Daemon socket: /tmp/ace-step-gen.sock

The daemon keeps the model weights resident in VRAM across requests (no 2 GB reload per call). If the socket is not available, fall back to running the CLI binary directly (see Fallback section below).

Workflow

Gather the required inputs from the user's request (see Parameters below).
Choose an output path under /tmp/ with a .ogg extension (smaller, good for Telegram/Discord).
Send a JSON request to the daemon socket and read the response.
Use send_file to deliver the audio file to the user.
Report the generation time if available.

Unicode / non-ASCII lyrics

CRITICAL: Always preserve the original Unicode characters in lyrics and captions. Never transliterate umlauts or accented characters to ASCII equivalents. For example:

Write Österreich, NOT Oesterreich
Write schönes Stück, NOT schoenes Stueck
Write Gemütlichkeit, NOT Gemuetlichkeit
Write Straße, NOT Strasse

The model was trained on real Unicode text and produces significantly better pronunciation when given proper characters. ASCII transliteration (ö→oe, ü→ue, ä→ae, ß→ss) will cause wrong pronunciation in the generated audio.

Sending a request to the daemon

The daemon speaks line-delimited JSON over a Unix socket. Send one JSON line, read one JSON response line.

Recommended method: Write the JSON to a temp file first, then pipe it to socat. This avoids shell quoting issues with Unicode, newlines, and special characters in lyrics:

# 1. Write JSON request to a temp file using the file tool
# 2. Then pipe it to the daemon:
cat /tmp/music_request.json | socat -t 120 - UNIX-CONNECT:/tmp/ace-step-gen.sock

Alternative: inline shell command (short ASCII-only requests)

For simple requests without lyrics or with ASCII-only text:

echo '{"caption":"upbeat jazz, 120 BPM","duration_s":30,"output":"/tmp/music_1234.ogg"}' \
  | socat -t 120 - UNIX-CONNECT:/tmp/ace-step-gen.sock

Do NOT use inline echo for requests with non-ASCII lyrics — use the file-based method above instead.

Request JSON fields

Field	Type	Default	Description
`caption`	string	required	Style description: genre, mood, tempo, instruments
`output`	string	auto `/tmp/ace-step-<ms>.ogg`	Output file path (.wav or .ogg)
`lyrics`	string	`""`	Lyrics with `[verse]`/`[chorus]`/`[bridge]` tags; `""` = instrumental
`metas`	string	`""`	Metadata: `"bpm: 128, key: A minor, genre: electronic"`
`language`	string	`"en"`	Lyrics language code (`"zh"` for Chinese)
`duration_s`	float	LM suggestion or 30.0	Duration in seconds (1–600). If omitted and LM is running, the LM may suggest a duration based on the caption.
`shift`	float	`3.0`	ODE schedule shift (1–3); lower = more faithful, less variation
`seed`	int\|null	`null`	Fixed seed for reproducibility; `null` = random

Response JSON fields (success)

{"ok": true, "path": "/tmp/music.ogg", "duration_s": 30.0, "sample_rate": 48000, "channels": 2}

Response JSON fields (error)

{"ok": false, "error": "generation failed: ..."}

Commands

The daemon also accepts command messages to manage the pipeline.

Unload (free VRAM)

Drops the pipeline from VRAM. The next generation request will reload it automatically (~10–20s reload time).

echo '{"command":"unload"}' | socat - UNIX-CONNECT:/tmp/ace-step-gen.sock

Response: {"ok":true,"message":"pipeline unloaded"}

Use this when VRAM is needed for other tasks (e.g. other GPU workloads). No need to restart the daemon — it stays running and reloads on demand.

Caption writing guide

Be specific — genre, mood, tempo, instruments, vibe:

"upbeat electronic dance music, 128 BPM, four-on-the-floor kick, synth arpeggios, euphoric build"
"melancholic lo-fi hip-hop, slow 70 BPM, dusty vinyl samples, soft piano, rain atmosphere"
"fast punk rock, distorted guitars, driving drums, raw energy, 180 BPM"
"cinematic orchestral trailer music, epic brass, driving strings, 140 BPM, intense build to climax"

Full example

Step 1 — Use the file tool to write the JSON request:

{
  "caption": "indie pop with dreamy synths, gentle vocals, 100 BPM, wistful and nostalgic",
  "lyrics": "[verse]\nNeon lights on rainy streets\nWhere the city never sleeps\n[chorus]\nWe were infinite, we were free\nJust the stars and you and me",
  "metas": "bpm: 100, key: G major, genre: indie pop, instruments: synth, guitar, drums",
  "duration_s": 45,
  "output": "/tmp/music_indie.ogg"
}

Save this to /tmp/music_request.json.

Step 2 — Use the shell tool to send it to the daemon:

cat /tmp/music_request.json | socat -t 120 - UNIX-CONNECT:/tmp/ace-step-gen.sock

The response will be a JSON line like {"ok":true,"path":"/tmp/music_indie.ogg",...}.

After generation

Always use send_file to deliver the audio to the user:

send_file(file_path="/tmp/music.ogg", caption="Here's your 30s cinematic trailer music!")

Fallback: CLI binary (if daemon is not running)

If socat returns an error connecting to the socket, the daemon is not running. Fall back to the CLI binary:

Binary: /home/marenz/Projects/ace-step-rs-no-cudnn/target/release/ace-step

/home/marenz/Projects/ace-step-rs-no-cudnn/target/release/ace-step \
  --caption "cinematic orchestral trailer music, epic brass, 140 BPM" \
  --duration 30 \
  --output /tmp/music.ogg

The binary reloads 2 GB of weights each run (~10–20s extra on first call after a cold start). Output format is the same JSON line to stdout on success.

Troubleshooting

Connection refused / no such file — daemon not running. Use fallback CLI binary.
ok: false — check the error field. Common causes: invalid caption, CUDA OOM, bad output path.
OGG not supported — use .wav extension in the output path instead.
Long generation time — reduce duration_s. 30s ≈ 1–2s on RTX 3090.