Skip to content

Separate Anything You Describe

Summary

AudioSep introduces a foundation model for open-domain audio source separation using natural language queries — you describe what sound to extract and the model separates it from the mixture. This "Separate Anything You Describe" paradigm represents a significant advance over traditional source separation systems that are limited to fixed source classes (e.g., "vocals," "drums," "bass," "other"). AudioSep uses CLAP (Contrastive Language-Audio Pretraining) to encode text queries into embeddings that condition a ResUNet-based separation network operating in the frequency domain. Trained on large-scale multimodal datasets spanning audio events, musical instruments, and speech, AudioSep demonstrates strong zero-shot generalization — it can successfully separate sounds it was never explicitly trained on, given only a descriptive text query.

For bluegrass music applications, AudioSep offers exciting possibilities. One could query "separate the banjo" or "separate the mandolin" from a bluegrass ensemble recording, even though the model wasn't specifically trained on bluegrass instruments. The language interface makes it accessible to non-technical users: a musician could type "extract the fiddle solo" and get that isolated track. The open-domain nature means it could potentially separate crowd noise from live recordings, isolate individual instruments for practice, or extract vocals from old recordings. AudioSep substantially outperforms previous audio-queried and language-queried separation models, making it the state-of-the-art for query-based separation at its release. The model (1,900 GitHub stars) is released open source with pre-trained checkpoints.

Key Claims

  • First foundation model for open-domain language-queried audio source separation (LASS)
  • Uses CLAP (Contrastive Language-Audio Pretraining) for text-audio embedding alignment
  • ResUNet-based separation network conditioned on CLAP text embeddings in frequency domain
  • Strong zero-shot generalization to unseen sound categories via descriptive text queries
  • Trained on large-scale multimodal datasets combining audio events, music, and speech
  • Evaluated on audio event separation, musical instrument separation, and speech enhancement
  • Substantially outperforms previous audio-queried and language-queried separation models
  • Open source with pre-trained models and evaluation benchmark