Hybrid Transformers for Music Source Separation¶

Summary¶

This paper introduces Hybrid Transformer Demucs (HT Demucs), a hybrid temporal/spectral bi-U-Net architecture for music source separation. Building on the Hybrid Demucs architecture, the authors replace the innermost layers with a cross-domain Transformer Encoder that applies self-attention within each domain (temporal and spectral) and cross-attention across domains. This design allows the model to integrate long-range contextual information that pure convolutional architectures may miss.

The authors find that HT Demucs performs poorly when trained solely on MUSDB, suggesting Transformers are data-hungry. However, with 800 extra training songs, it outperforms Hybrid Demucs by 0.45 dB of SDR. Further improvements come from sparse attention kernels to extend the receptive field and per-source fine-tuning, achieving state-of-the-art results of 9.20 dB SDR on MUSDB with extra training data. The work demonstrates that attention-based architectures can benefit music source separation when sufficient training data is available.

Key Claims¶

Cross-domain Transformer Encoder (self-attention within domains, cross-attention across) improves over pure convolutional bi-U-Net
Transformers require more training data — HT Demucs underperforms on MUSDB alone but excels with 800+ extra songs
Sparse attention kernels extend effective receptive field without quadratic cost
Per-source fine-tuning yields additional SDR gains
Achieved SOTA 9.20 dB SDR on MUSDB with extra training data at time of publication

../concepts/spectrogram-unets — HT Demucs builds on the hybrid spectrogram/waveform U-Net architecture
../concepts/synthetic-mixing-pipelines — Extra training songs created via synthetic mixing for data augmentation
../entities/demucs — Direct predecessor; HT Demucs adds cross-domain Transformer layers
../entities/musdb18 — Primary benchmark dataset used for evaluation

Hybrid Transformers for Music Source Separation¶

Summary¶

Key Claims¶

Related¶