AudioSep¶

Summary¶

Text-conditional source separation system that accepts natural language queries (e.g., "separate the banjo") and isolates the described sound source from a mixture. Uses CLAP (Contrastive Language-Audio Pretraining) embeddings to project text queries and audio features into a shared space, then a ResUNet decoder performs the separation. Enables open-vocabulary separation rather than fixed stem categories. Presented on arXiv Aug 2023 (1.9k GitHub stars).

Key Claims¶

Text-conditional separation works for open-vocabulary queries
CLAP joint embedding provides effective conditioning signal for separation
Single model handles diverse sound types (music, environmental, speech)
Query-based separation is more flexible than fixed-stem approaches

Relevance to Bluegrass¶

Most intriguing tool for bluegrass separation. In theory, querying "banjo," "mandolin," or "fiddle" could separate those instruments without needing instrument-specific models. However: (1) CLAP was trained on general audio, not bluegrass vocabulary; (2) separation quality for fine-grained instrument distinction on acoustic folk music is untested; (3) the model may conflate similar-timbre string instruments.

Worth testing on bluegrass recordings — if it works even moderately well, it bypasses the need for custom-trained bluegrass separation models.

GitHub: Audio-AGI/AudioSep (1.9k stars). LGPL license. Pretrained checkpoints available.

../entities/audiosef — entity page
../entities/clap — underlying audio-language model
../concepts/query-based-source-separation — concept page
../entities/demucs — alternative: fixed-stem separation

AudioSep¶

Summary¶

Key Claims¶

Relevance to Bluegrass¶

Related¶