Skip to content

Hybrid Transformers for Music Source Separation

Summary

This paper introduces Hybrid Transformer Demucs (HT Demucs), a hybrid temporal/spectral bi-U-Net architecture for music source separation. Building on the Hybrid Demucs architecture, the authors replace the innermost layers with a cross-domain Transformer Encoder that applies self-attention within each domain (temporal and spectral) and cross-attention across domains. This design allows the model to integrate long-range contextual information that pure convolutional architectures may miss.

The authors find that HT Demucs performs poorly when trained solely on MUSDB, suggesting Transformers are data-hungry. However, with 800 extra training songs, it outperforms Hybrid Demucs by 0.45 dB of SDR. Further improvements come from sparse attention kernels to extend the receptive field and per-source fine-tuning, achieving state-of-the-art results of 9.20 dB SDR on MUSDB with extra training data. The work demonstrates that attention-based architectures can benefit music source separation when sufficient training data is available.

Key Claims

  • Cross-domain Transformer Encoder (self-attention within domains, cross-attention across) improves over pure convolutional bi-U-Net
  • Transformers require more training data — HT Demucs underperforms on MUSDB alone but excels with 800+ extra songs
  • Sparse attention kernels extend effective receptive field without quadratic cost
  • Per-source fine-tuning yields additional SDR gains
  • Achieved SOTA 9.20 dB SDR on MUSDB with extra training data at time of publication