music-stem-separation - SKILL.md Agent Skill

name: music-stem-separation description: "Cross-platform AI music stem separation pipeline (Windows/macOS/Linux): vocal isolation, harmony splitting, and reverb removal via an ultra-high-fidelity audio-separator ensemble, with GPU acceleration where available." tags: [audio, music, stem-separation, vocals, audio-separator, dereverb, ensemble, cross-platform, windows, macos, linux] triggers: - stem separation - vocal isolation - extract vocals - remove vocals - separate instruments - 干声分离 - 人声分离 - 去混响 - dereverb

Music Stem Separation Pipeline (cross-platform)

Best-quality ensemble pipeline for extracting clean, dropout-free dry vocals from mixed audio. It avoids aggressive single models (which cause "vocal dropouts") by combining multi-model ensembles with light physical EQ. Runs identically on Windows, macOS, and Linux — all OS-specific logic lives in the bundled scripts/separate.py, so the agent only needs to install the prerequisites and run one command.

Prerequisites

Requirement	Windows	macOS	Linux
Python 3.10+	`winget install -e --id Python.Python.3.12`	`brew install python`	distro package (e.g. `apt install python3`)
`uv` package manager	`winget install -e --id astral-sh.uv`	`brew install uv`	`curl -LsSf https://astral.sh/uv/install.sh \| sh`
ffmpeg	`winget install -e --id Gyan.FFmpeg`	`brew install ffmpeg`	`apt install ffmpeg` / `dnf install ffmpeg` / `pacman -S ffmpeg`

Install audio-separator (choose the right extra for your hardware)

# NVIDIA GPU (Windows / Linux with CUDA) — fastest:
uv tool install "audio-separator[gpu]"

# macOS (Apple Silicon or Intel), or any machine without an NVIDIA GPU:
uv tool install "audio-separator[cpu]"

audio-separator auto-selects the accelerator at runtime — CUDA on NVIDIA, CoreML/MPS on Apple Silicon, CPU otherwise — so no device flags are needed. GPU is strongly recommended; CPU works but is very slow.

Model storage (important pitfall)

By default audio-separator caches models under the system temp dir, which the OS may clear. Always keep them in a persistent folder. The script defaults to ~/models/audio-separator-models (works on every OS) and passes --model_file_dir for you. Override with --models-dir if you want a different location. Models download automatically on first use.

Quick start

Run the bundled cross-platform driver (<SKILL_DIR> = this skill's folder):

# macOS / Linux
python "<SKILL_DIR>/scripts/separate.py" --input "/path/to/song.flac" --song "歌名"

# Windows (PowerShell)
python "<SKILL_DIR>\scripts\separate.py" --input "C:\path\to\song.mp3" --song "歌名"

Options: --outdir <dir> (default ~/Music/<song>/干声分离), --models-dir <dir>, --keep-temp (retain intermediate folders for debugging). The input may be any format ffmpeg can read; it is normalized to WAV automatically.

What the pipeline does (4 ensemble stages)

To stop the AI from "eating" or dropping notes, this skips demucs entirely and relies on robust ensembles (--ensemble_algorithm avg_fft). The script runs these stages; the model names are identical on every OS:

Step	Purpose	Main model	+ Ensemble model
0	Normalize input → 44.1 kHz / 16-bit WAV	(ffmpeg)	—
1	Vocal extraction, no dropouts	`model_bs_roformer_ep_368_sdr_12.9628.ckpt`	`MDX23C-8KFFT-InstVoc_HQ.ckpt`
2	Remove backing harmonies (karaoke)	`mel_band_roformer_karaoke_aufr33_viperx_sdr_10.1956.ckpt`	`UVR_MDXNET_KARA_2.onnx`
3	Gentle dereverb	`UVR-DeEcho-DeReverb.pth`	`Reverb_HQ_By_FoxJoy.onnx`
4	Ultra-light de-essing (physical EQ)	(ffmpeg `deesser` + gentle high-shelf cut)	—

The exact audio-separator invocation per stage is:

audio-separator "<input>.wav" \
  --model_file_dir "<models-dir>" \
  -m "<main model>" \
  --extra_models "<ensemble model>" \
  --ensemble_algorithm avg_fft \
  --output_format WAV \
  --output_dir "<temp-dir>"

and the final de-essing pass:

ffmpeg -y -i "3_纯主唱_已去混响_未去刺.wav" \
  -af "deesser=i=0.2,treble=g=-1:f=7500:w=1" \
  "4_终极干声_全集成保真去刺版.wav"

Output organization

Music/
  └── <歌名>/
        └── 干声分离/
              1_伴奏.wav
              1_全人声_含和声混响.wav
              2_和声.wav
              2_纯主唱_含混响.wav
              3_被抽离的混响.wav
              3_纯主唱_已去混响_未去刺.wav
              4_终极干声_全集成保真去刺版.wav    ← Final deliverable

Rules

Project folder: outputs go to Music/<歌名>/干声分离/ (override with --outdir).
Format: all inputs/outputs are strictly WAV (--output_format WAV); the script auto-converts the source input first.
No Demucs: never use demucs — it aggressively drops vocals. Stick to the audio-separator ensemble pipeline.
Persistent models: always use a stable --models-dir (default ~/models/audio-separator-models) so models aren't re-downloaded into a temp dir and erased.
Cleanup: the script removes _临时_* folders on success; pass --keep-temp to keep them when debugging.