Skip to content

Music Source Separation with Band-Split RoPE Transformer

Summary

This paper proposes BS-RoFormer, a novel frequency-domain approach to music source separation based on a Band-Split RoPE Transformer. Unlike waveform-domain models like Demucs, BS-RoFormer operates on complex spectrograms. The key innovation is a band-split module that projects the input spectrogram into subband-level representations, followed by a stack of hierarchical Transformers that model both inner-band and inter-band sequences for multi-band mask estimation.

The authors employ Rotary Position Embedding (RoPE) to facilitate Transformer training for the MSS task. The full BS-RoFormer system, trained on MUSDB18HQ and 500 extra songs, won first place in the Music Separation track of the Sound Demixing Challenge (SDX23). Notably, a smaller version of BS-RoFormer achieves state-of-the-art results on MUSDB18HQ without any extra training data, reaching 9.80 dB of average SDR. This demonstrates the architectural efficiency of band-split processing combined with RoPE-based Transformers.

Key Claims

  • Band-split module enables subband-level processing, reducing computation while preserving frequency resolution
  • Hierarchical Transformers model both inner-band and inter-band dependencies
  • Rotary Position Embedding (RoPE) improves Transformer training for MSS
  • Won first place in SDX23 MSS track (full system with extra data)
  • Smaller BS-RoFormer achieves SOTA 9.80 dB SDR on MUSDB18HQ without extra training data
  • Frequency-domain approach competes with and surpasses waveform-domain models