Separate Anything You Describe¶

Summary¶

AudioSep introduces a foundation model for open-domain audio source separation using natural language queries — you describe what sound to extract and the model separates it from the mixture. This "Separate Anything You Describe" paradigm represents a significant advance over traditional source separation systems that are limited to fixed source classes (e.g., "vocals," "drums," "bass," "other"). AudioSep uses CLAP (Contrastive Language-Audio Pretraining) to encode text queries into embeddings that condition a ResUNet-based separation network operating in the frequency domain. Trained on large-scale multimodal datasets spanning audio events, musical instruments, and speech, AudioSep demonstrates strong zero-shot generalization — it can successfully separate sounds it was never explicitly trained on, given only a descriptive text query.

For bluegrass music applications, AudioSep offers exciting possibilities. One could query "separate the banjo" or "separate the mandolin" from a bluegrass ensemble recording, even though the model wasn't specifically trained on bluegrass instruments. The language interface makes it accessible to non-technical users: a musician could type "extract the fiddle solo" and get that isolated track. The open-domain nature means it could potentially separate crowd noise from live recordings, isolate individual instruments for practice, or extract vocals from old recordings. AudioSep substantially outperforms previous audio-queried and language-queried separation models, making it the state-of-the-art for query-based separation at its release. The model (1,900 GitHub stars) is released open source with pre-trained checkpoints.

Key Claims¶

First foundation model for open-domain language-queried audio source separation (LASS)
Uses CLAP (Contrastive Language-Audio Pretraining) for text-audio embedding alignment
ResUNet-based separation network conditioned on CLAP text embeddings in frequency domain
Strong zero-shot generalization to unseen sound categories via descriptive text queries
Trained on large-scale multimodal datasets combining audio events, music, and speech
Evaluated on audio event separation, musical instrument separation, and speech enhancement
Substantially outperforms previous audio-queried and language-queried separation models
Open source with pre-trained models and evaluation benchmark

../concepts/query-based-source-separation — AudioSep is the canonical example of language-queried source separation
../concepts/spectrogram-unets — the ResUNet backbone architecture operating on spectrograms
../concepts/self-supervised-audio-representation — CLAP embeddings provide the self-supervised audio-text alignment
../concepts/synthetic-mixing-pipelines — training uses synthetic mixtures from isolated source datasets
../entities/audiosef — the model and codebase
../entities/clap — the Contrastive Language-Audio Pretraining model used as the query encoder

Separate Anything You Describe¶

Summary¶

Key Claims¶

Related¶