Music Source Separation in the Waveform Domain¶
Summary¶
This paper introduces Demucs, a waveform-to-waveform model for music source separation that operates directly on raw audio rather than on spectrogram masks. The authors compare two waveform-domain architectures: an adaptation of Conv-Tasnet (originally for speech separation) and their proposed Demucs architecture. Conv-Tasnet beats many spectrogram-domain methods but introduces significant artifacts according to human evaluations.
Demucs uses a U-Net structure with bidirectional LSTM layers and operates end-to-end on waveforms. On the MUSDB dataset with proper data augmentation, Demucs achieves 6.3 dB SDR on average, outperforming all existing architectures at the time. With 150 extra training songs, it reaches 6.8 dB SDR and even surpasses the IRM oracle for the bass source. The paper also demonstrates that Demucs can be compressed to 120MB via quantization without quality loss. Human evaluations show Demucs produces more natural-sounding separations than competitors, though some inter-source bleeding remains, particularly between vocals and other sources.
Key Claims¶
- Waveform-domain U-Net with bidirectional LSTM outperforms spectrogram-mask approaches on music
- Conv-Tasnet adapted to music beats spectrogram methods but has audible artifacts
- Demucs achieves 6.3 dB SDR on MUSDB (6.8 with extra data), surpassing IRM oracle for bass
- Model quantizes to 120MB with no quality loss
- Human evaluations confirm superior naturalness, but bleeding between vocals and accompaniment persists
Related¶
- ../concepts/spectrogram-unets — Demucs contrasts with spectrogram-based U-Net mask inference approaches
- ../concepts/synthetic-mixing-pipelines — Data augmentation via synthetic mixes critical to performance
- ../entities/demucs — The model introduced by this paper
- ../entities/musdb18 — Benchmark dataset used for evaluation
- ../entities/asteroid — Open-source toolkit that includes Demucs implementation