Music Source Separation with Band-Split RoPE Transformer¶
Summary¶
This paper proposes BS-RoFormer, a novel frequency-domain approach to music source separation based on a Band-Split RoPE Transformer. Unlike waveform-domain models like Demucs, BS-RoFormer operates on complex spectrograms. The key innovation is a band-split module that projects the input spectrogram into subband-level representations, followed by a stack of hierarchical Transformers that model both inner-band and inter-band sequences for multi-band mask estimation.
The authors employ Rotary Position Embedding (RoPE) to facilitate Transformer training for the MSS task. The full BS-RoFormer system, trained on MUSDB18HQ and 500 extra songs, won first place in the Music Separation track of the Sound Demixing Challenge (SDX23). Notably, a smaller version of BS-RoFormer achieves state-of-the-art results on MUSDB18HQ without any extra training data, reaching 9.80 dB of average SDR. This demonstrates the architectural efficiency of band-split processing combined with RoPE-based Transformers.
Key Claims¶
- Band-split module enables subband-level processing, reducing computation while preserving frequency resolution
- Hierarchical Transformers model both inner-band and inter-band dependencies
- Rotary Position Embedding (RoPE) improves Transformer training for MSS
- Won first place in SDX23 MSS track (full system with extra data)
- Smaller BS-RoFormer achieves SOTA 9.80 dB SDR on MUSDB18HQ without extra training data
- Frequency-domain approach competes with and surpasses waveform-domain models
Related¶
- ../concepts/spectrogram-unets — BS-RoFormer is a frequency-domain alternative to U-Net based mask inference
- ../concepts/permutation-invariant-training — Training paradigm relevant to multi-source separation
- ../entities/bs-roformer — The model introduced by this paper
- ../entities/musdb18 — Primary benchmark (MUSDB18HQ used for evaluation)
- ../entities/demucs — Contrasting waveform-domain approach; BS-RoFormer achieves higher SDR