
A practical comparison of three leading open-source audio separation models — covering SDR scores, inference cost, real-world latency, and when each one actually makes sense in production.
If you've spent any time looking at AI music separation in the last twelve months, you've probably run into the same three names: Spleeter, htdemucs (Hybrid Transformer Demucs), and BS-RoFormer. They show up in every comparison post, every research paper, and every "how to extract vocals" tutorial — but the way they're compared is usually wrong. Most posts cite a single SDR number from a 2019 paper and call it a day.
That's not useful if you're trying to ship a product, build a pipeline, or pick a model for real audio.
This post compares the three on the dimensions that actually matter when you're deploying audio separation:
Everything below is based on published benchmarks plus our own production deployment of htdemucs at scale. Where we cite numbers, we cite the source.
| Model | Best for | Output stems | Quality (avg SDR) | Speed |
|---|---|---|---|---|
| Spleeter | Real-time, low-resource, batch processing | 2, 4, or 5 | ~5.9 dB (vocals) | ~100× real-time on GPU |
| htdemucs | Production C2C apps, balance of quality and speed | 4 or 6 | ~9.0 dB (avg) | ~5–8× real-time on A40 |
| BS-RoFormer | Highest-fidelity offline work, mastering, archival | 4 (typically) | ~9.80 dB (avg) | ~2–3× real-time on A40 |
If you take only one thing from this post: htdemucs is the right default for almost any product, and you should probably be running htdemucs_ft rather than the default checkpoint. On Replicate's serverless pricing, all three Demucs variants (default, 6s, ft) cost essentially the same per call — but ft delivers meaningfully better separation. We didn't expect this when we started; it only became clear after looking at our actual billing.
BS-RoFormer is meaningfully better only on bass and only when latency doesn't matter. Spleeter is a 2019 model running on 2026 hardware — fast, but the quality gap is now audible.
The rest of this post explains why.
Music source separation quality is usually measured in Signal-to-Distortion Ratio (SDR), in decibels. Higher is better. The reference dataset is MUSDB18 (or MUSDB18-HQ for high-quality audio), which contains 150 full-length tracks with isolated stems for vocals, drums, bass, and "other."
A few practical anchors:
Anything above ~9 dB on vocals is generally past the point where most listeners can tell the difference in a blind test. The gains from there are about edge cases — heavy reverb, doubled vocals, complex mixes.
A note on SI-SDR: Some recent papers report SI-SDR (scale-invariant SDR), which corrects for simple gain differences and is more robust. When numbers in this post differ from other sources, the metric definition is usually the reason.
Released by the Deezer research team in 2019, Spleeter is a U-Net architecture operating in the spectrogram domain. It comes in 2-stem (vocals/accompaniment), 4-stem (vocals/drums/bass/other), and 5-stem (adds piano) configurations.
It was a landmark release at the time — the first time anyone could run good-enough source separation on a laptop CPU without licensing fees. Six years later, it's been overtaken on quality by every modern model, but it remains the fastest and lightest option by a wide margin.
The fourth-generation Demucs model from Meta AI's research team. Unlike Spleeter, htdemucs is a hybrid model — it operates in both the time domain (waveform) and frequency domain (spectrogram), with a Transformer backbone connecting them. The original paper reports a 1.4 dB SDR improvement over the previous Demucs generation on MUSDB-HQ.
Two variants matter in practice:
htdemucs — the standard 4-stem modelhtdemucs_6s — a 6-stem variant that adds isolated guitar and piano stemsThere's also htdemucs_ft, a fine-tuned version that's slower but slightly more accurate on individual stems.
htdemucs placed competitively in the 2021 Sony Music Demixing Challenge and remains the default for most production pipelines that aren't chasing the absolute SOTA.
The current state of the art on MUSDB18-HQ, BS-RoFormer (Band-Split RoPE Transformer) is a pure-Transformer architecture that replaces RNN modules with a hierarchical RoPE Transformer. It splits the input spectrogram into multiple non-overlapping frequency sub-bands, exploiting the fact that different instruments occupy characteristic frequency ranges (bass low, cymbals high, etc.).
BS-RoFormer trained on MUSDB18-HQ plus 500 extra songs won first place in the Music Source Separation track of the Sound Demixing Challenge 2023 (SDX23). Even the smaller version trained without extra data reports 9.80 dB average SDR on MUSDB18-HQ.
The downside: it's slower and more memory-intensive than htdemucs, and the production-ready open weights are still scattered across community implementations rather than a single canonical release.
This is where most comparison posts fall apart — they cherry-pick a single number. Here are the per-stem SDR scores from the published literature, on MUSDB18-HQ (no extra training data unless noted):
| Model | Vocals | Drums | Bass | Other | Average |
|---|---|---|---|---|---|
| Spleeter (4-stem) | ~5.9 dB | ~5.9 dB | ~5.5 dB | ~4.5 dB | ~5.4 dB |
| htdemucs (default) | ~8.1 dB | ~8.4 dB | ~8.6 dB | ~5.9 dB | ~7.7 dB |
| htdemucs_ft (fine-tuned) | ~8.9 dB | ~9.5 dB | ~9.4 dB | ~6.4 dB | ~8.5 dB |
| BS-RoFormer (no extra data) | — | — | ~11.28 dB | — | ~9.80 dB |
| BS-RoFormer (with 500 extra songs) | — | — | — | — | ~9.76 dB+ |
Sources: Spleeter scores from the Spleeter JOSS paper and the BeatsToRapOn separation benchmark. htdemucs scores from Hybrid Spectrogram and Waveform Source Separation and Benchmarks and leaderboards for sound demixing tasks. BS-RoFormer scores from the SDX23 results documented in the same paper.
A few observations from the table:
The Spleeter → htdemucs gap is bigger than the htdemucs → BS-RoFormer gap. Going from Spleeter to htdemucs gets you roughly +2.3 dB on average. Going from htdemucs to BS-RoFormer gets you roughly +1.3 dB. This is why htdemucs is the practical sweet spot for most use cases.
BS-RoFormer's biggest win is on bass. Bass separation jumps from ~8.6 dB (htdemucs) to ~11.28 dB (BS-RoFormer) — a difference you can hear in a blind test. The vocal and drum gains are smaller. If you're building something that specifically needs clean bass (DJ tools, transcription, music education for bass players), BS-RoFormer is worth the extra compute. For everything else, the gain is on the edge of perceptible.
htdemucs_ft is underrated. Many comparison posts only test the default htdemucs checkpoint. The fine-tuned version (htdemucs_ft) closes most of the gap to BS-RoFormer at the cost of roughly 4× the inference time — still faster than BS-RoFormer in practice.
Approximate end-to-end time for a 3-minute song on a single A40 GPU, measured from API call to download-ready output:
| Model | End-to-end time | Real-time multiplier |
|---|---|---|
| Spleeter (4-stem, GPU) | ~2–5 seconds | ~40–90× real-time |
| htdemucs (default, 4-stem) | ~30–45 seconds | ~4–6× real-time |
| htdemucs_6s (6-stem) | ~40–60 seconds | ~3–5× real-time |
| htdemucs_ft (fine-tuned) | ~90–150 seconds | ~1.2–2× real-time |
| BS-RoFormer | ~60–120 seconds | ~1.5–3× real-time |
Notes:
overlap parameter is a big speed lever. The default overlap=0.25 is a reasonable trade-off; setting overlap=0.5 improves quality slightly at ~2× the cost; setting overlap=0 makes it noticeably faster but introduces audible chunking artifacts at segment boundaries.If you're shipping a consumer product where users wait for results, anything slower than ~60 seconds for a 3-minute song starts to hurt conversion in our experience. That keeps htdemucs (default and 6s) inside acceptable territory and pushes htdemucs_ft and BS-RoFormer toward async/queued flows where the user can come back later.
This is the section where most online comparisons are completely wrong. Public pricing on Replicate looks straightforward — A40 at $0.000725/second, multiply by inference time, done. In practice, that calculation is off by roughly 2× from your actual bill, and there's a more interesting wrinkle that almost no comparison post mentions.
We've been running htdemucs in production at aistemsplitter.org for several months across all three Demucs variants — htdemucs (default 4-stem), htdemucs_6s (6-stem), and htdemucs_ft (fine-tuned). On Replicate's A40 GPU instances, all three variants cost approximately the same per call in our actual billing: roughly 22 calls per $1, or about $0.045 per song.
That's worth pausing on, because it contradicts what you'd expect from the published inference times.
| Model | Naive cost (public pricing × inference time) | Our actual measured cost |
|---|---|---|
| Spleeter (GPU) | <$0.002 | <$0.005 |
| htdemucs (default) | ~$0.022 | ~$0.045 |
| htdemucs_6s (6-stem) | ~$0.029 | ~$0.045 |
| htdemucs_ft (fine-tuned) | ~$0.11 | ~$0.045 |
| BS-RoFormer | ~$0.065 | ~$0.06–0.10 (varies) |
The naive pricing model assumes you pay only for pure GPU inference time. In reality, every Replicate call also includes:
These overheads are roughly fixed costs per invocation — they don't scale with how complex your model is. When the GPU forward pass goes from 30 seconds (htdemucs default) to 90 seconds (htdemucs_ft), the additional compute matters less to the bill than you'd expect, because the per-call overhead is already eating most of the budget.
The practical implication: if you're already on the htdemucs platform, there's almost no economic reason not to use the highest-quality variant your latency budget allows. If your users will wait 60 seconds, use htdemucs_6s (6 stems, default speed). If they'll wait 2 minutes, use htdemucs_ft (fine-tuned, near-BS-RoFormer quality on most stems). The bill is the same.
This is the opposite of the conclusion you'd reach by reading academic papers and Replicate's posted GPU pricing. It only shows up when you actually look at your bill at the end of the month.
If you're modeling unit economics for a stem separation product, plan for $0.04–$0.05 per song as your floor, regardless of which Demucs variant you choose. That sets:
Two important caveats:
| Model | Available stem configurations | Notes |
|---|---|---|
| Spleeter | 2, 4, or 5 stems | 5-stem adds piano (separate model) |
| htdemucs | 4 or 6 stems | htdemucs_6s adds guitar + piano |
| BS-RoFormer | 4 stems (mostly); some 6-stem community builds | Quality drops on the rarer guitar/piano stems |
This is where htdemucs_6s genuinely stands alone. If your use case requires isolated guitar or piano stems (music education, multi-track remixing, transcription), htdemucs_6s is the only widely-deployed model that delivers them at production quality. BS-RoFormer 6-stem variants exist in the community but are less mature; the canonical BS-RoFormer is a 4-stem system.
For "vocals only" or "instrumental only" use cases (the karaoke crowd), all three models work fine, and you should pick on speed, not quality. Spleeter at 90× real-time will give you a usable instrumental in milliseconds.
After running these in production for several months, here's the simple decision tree we'd give someone starting from scratch:
Pick Spleeter when:
Pick htdemucs when:
htdemucs_6s)Pick BS-RoFormer when:
Don't pick any of these when:
We run htdemucs_6s in production at aistemsplitter.org — a hosted version of 6-stem separation aimed at people who don't want to set up the local toolchain (which, between PyTorch versions, CUDA versions, and audio dependency hell, takes most people a full afternoon).
A few things we learned that aren't in the papers:
htdemucs, htdemucs_6s, or htdemucs_ft. The fixed overhead per call swamps the marginal compute difference between models. This single fact changed how we think about model selection: pick on quality, not on theoretical compute cost, because the cost difference doesn't actually show up in your bill.If you want to hear what 6-stem htdemucs sounds like on real audio without setting up the toolchain, our site has free credits to try a few songs.
A few open questions worth watching in 2026:
If you're working in this space and have data we'd find interesting — or you've hit something on these models we haven't — drop us a line.
Last updated: April 2026. If you find an error in the data, the SDR numbers, or any of the practical claims, send us a correction and we'll update the post with attribution.
