Hybrid Transformers for Music Source Separation¶
Summary¶
This paper introduces Hybrid Transformer Demucs (HT Demucs), a hybrid temporal/spectral bi-U-Net architecture for music source separation. Building on the Hybrid Demucs architecture, the authors replace the innermost layers with a cross-domain Transformer Encoder that applies self-attention within each domain (temporal and spectral) and cross-attention across domains. This design allows the model to integrate long-range contextual information that pure convolutional architectures may miss.
The authors find that HT Demucs performs poorly when trained solely on MUSDB, suggesting Transformers are data-hungry. However, with 800 extra training songs, it outperforms Hybrid Demucs by 0.45 dB of SDR. Further improvements come from sparse attention kernels to extend the receptive field and per-source fine-tuning, achieving state-of-the-art results of 9.20 dB SDR on MUSDB with extra training data. The work demonstrates that attention-based architectures can benefit music source separation when sufficient training data is available.
Key Claims¶
- Cross-domain Transformer Encoder (self-attention within domains, cross-attention across) improves over pure convolutional bi-U-Net
- Transformers require more training data — HT Demucs underperforms on MUSDB alone but excels with 800+ extra songs
- Sparse attention kernels extend effective receptive field without quadratic cost
- Per-source fine-tuning yields additional SDR gains
- Achieved SOTA 9.20 dB SDR on MUSDB with extra training data at time of publication
Related¶
- ../concepts/spectrogram-unets — HT Demucs builds on the hybrid spectrogram/waveform U-Net architecture
- ../concepts/synthetic-mixing-pipelines — Extra training songs created via synthetic mixing for data augmentation
- ../entities/demucs — Direct predecessor; HT Demucs adds cross-domain Transformer layers
- ../entities/musdb18 — Primary benchmark dataset used for evaluation