Separate Anything You Describe¶
Summary¶
AudioSep introduces a foundation model for open-domain audio source separation using natural language queries — you describe what sound to extract and the model separates it from the mixture. This "Separate Anything You Describe" paradigm represents a significant advance over traditional source separation systems that are limited to fixed source classes (e.g., "vocals," "drums," "bass," "other"). AudioSep uses CLAP (Contrastive Language-Audio Pretraining) to encode text queries into embeddings that condition a ResUNet-based separation network operating in the frequency domain. Trained on large-scale multimodal datasets spanning audio events, musical instruments, and speech, AudioSep demonstrates strong zero-shot generalization — it can successfully separate sounds it was never explicitly trained on, given only a descriptive text query.
For bluegrass music applications, AudioSep offers exciting possibilities. One could query "separate the banjo" or "separate the mandolin" from a bluegrass ensemble recording, even though the model wasn't specifically trained on bluegrass instruments. The language interface makes it accessible to non-technical users: a musician could type "extract the fiddle solo" and get that isolated track. The open-domain nature means it could potentially separate crowd noise from live recordings, isolate individual instruments for practice, or extract vocals from old recordings. AudioSep substantially outperforms previous audio-queried and language-queried separation models, making it the state-of-the-art for query-based separation at its release. The model (1,900 GitHub stars) is released open source with pre-trained checkpoints.
Key Claims¶
- First foundation model for open-domain language-queried audio source separation (LASS)
- Uses CLAP (Contrastive Language-Audio Pretraining) for text-audio embedding alignment
- ResUNet-based separation network conditioned on CLAP text embeddings in frequency domain
- Strong zero-shot generalization to unseen sound categories via descriptive text queries
- Trained on large-scale multimodal datasets combining audio events, music, and speech
- Evaluated on audio event separation, musical instrument separation, and speech enhancement
- Substantially outperforms previous audio-queried and language-queried separation models
- Open source with pre-trained models and evaluation benchmark
Related¶
- ../concepts/query-based-source-separation — AudioSep is the canonical example of language-queried source separation
- ../concepts/spectrogram-unets — the ResUNet backbone architecture operating on spectrograms
- ../concepts/self-supervised-audio-representation — CLAP embeddings provide the self-supervised audio-text alignment
- ../concepts/synthetic-mixing-pipelines — training uses synthetic mixtures from isolated source datasets
- ../entities/audiosef — the model and codebase
- ../entities/clap — the Contrastive Language-Audio Pretraining model used as the query encoder