stem-separation

star 1

AI-powered audio stem separation using Demucs and BSRNN for music production workflows

SpiralCloudOmega By SpiralCloudOmega schedule Updated 3/7/2026

name: stem-separation description: AI-powered audio stem separation using Demucs and BSRNN for music production workflows

AI Stem Separation

Architecture Overview

Demucs v4 (Hybrid Transformer) encodes audio through parallel waveform and spectrogram paths, fuses them inside a transformer with cross-attention, then decodes back to isolated stems. The hybrid design captures temporal transients (waveform) and harmonic content (spectrogram), yielding state-of-the-art SDR on musdb18-HQ.

BSRNN (Band-Split RNN) splits the spectrogram into learnable frequency sub-bands, processes each with dedicated RNNs, then merges via band fusion. Excels at vocal separation where harmonics occupy narrow frequency regions.

Output Configurations

Model Stems Targets
htdemucs 4 vocals, drums, bass, other
htdemucs_ft 4 vocals, drums, bass, other (fine-tuned, highest SDR)
htdemucs_6s 6 vocals, drums, bass, guitar, piano, other
mdx_extra 4 vocals, drums, bass, other (MDX-Net architecture)

Use htdemucs_ft for max quality 4-stem; use htdemucs_6s when you need guitar/piano isolation.

CLI Usage

# Basic 4-stem separation
demucs --two-stems=vocals -n htdemucs_ft input.wav

# 6-stem separation on GPU with 10-second overlap for gapless output
demucs -n htdemucs_6s --device cuda --overlap 0.25 --clip-mode rescale input.wav

# Batch process a folder, output as float32 WAV
demucs -n htdemucs_ft --device cuda --float32 --out ./stems/ tracks/*.wav

Python — Single-Track Separation

import torch, torchaudio
from demucs.pretrained import get_model
from demucs.apply import apply_model

model = get_model("htdemucs_ft")
model.to("cuda" if torch.cuda.is_available() else "cpu")
wav, sr = torchaudio.load("mix.wav")
with torch.no_grad():
    sources = apply_model(model, wav.unsqueeze(0).to(model.device), overlap=0.25)
for i, name in enumerate(model.sources):  # ['drums', 'bass', 'other', 'vocals']
    torchaudio.save(f"{name}.wav", sources[0, i].cpu(), sr)

Python — GPU Batch Processing

import torch, glob, torchaudio
from demucs.pretrained import get_model
from demucs.apply import apply_model

model = get_model("htdemucs_ft").to("cuda")
for path in glob.glob("tracks/*.wav"):
    wav, sr = torchaudio.load(path)
    sources = apply_model(model, wav.unsqueeze(0).cuda(), overlap=0.25,
                          split=True, segment=7.8)  # auto-chunk for VRAM safety
    for i, name in enumerate(model.sources):
        torchaudio.save(f"stems/{path.split('/')[-1]}_{name}.wav", sources[0, i].cpu(), sr)
    torch.cuda.empty_cache()

Quality Metrics

SDR (Signal-to-Distortion Ratio) is the primary benchmark; higher is better. SIR measures bleed, SAR measures artifacts.

import numpy as np

def sdr(reference: np.ndarray, estimate: np.ndarray) -> float:
    noise = reference - estimate
    return 10 * np.log10(np.sum(reference ** 2) / (np.sum(noise ** 2) + 1e-8))

Typical htdemucs_ft on musdb18-HQ: vocals 8.1 dB, drums 8.4 dB, bass 7.7 dB, other 5.6 dB.

Real-Time vs Offline

Aspect Real-time Offline
Latency <50 ms (causal model) Minutes per track
Quality Lower SDR (no future context) Best SDR (bidirectional)
Use case Live monitoring, DJ sets Final mix, stem mastering

For REVITHION, run offline htdemucs_ft for master-quality stems; use real-time for preview only.

DAW Integration

Route each stem to its own mixer bus for independent EQ/compression/spatial processing.

for stem_path, bus_id in zip(stem_files, [1, 2, 3, 4]):
    session.import_audio(stem_path, target_bus=bus_id, align_to="project_start")

Anti-Patterns

  • Separating already-separated stems — cascading models multiplies artifacts; always use the original mix.
  • Ignoring overlapoverlap=0 creates audible seams at chunk boundaries.
  • FP16 on long tracks — half-precision accumulates error over millions of samples; use FP32.
  • Skipping VRAM management — omitting torch.cuda.empty_cache() between tracks causes OOM.

Checklist

  • Select model matching target stem count (4-stem vs 6-stem)
  • Verify GPU memory headroom before processing (segment size × batch)
  • Use --float32 and overlap ≥ 0.25 for production renders
  • Validate output with SDR against reference where available
  • Route stems to dedicated mixer buses before applying effects
  • Archive original mix alongside stems for reproducibility
Install via CLI
npx skills add https://github.com/SpiralCloudOmega/DevTeam6 --skill stem-separation
Repository Details
star Stars 1
call_split Forks 0
navigation Branch main
article Path SKILL.md
More from Creator
SpiralCloudOmega
SpiralCloudOmega Explore all skills →