AudioSep¶
Summary¶
Text-conditional source separation system that accepts natural language queries (e.g., "separate the banjo") and isolates the described sound source from a mixture. Uses CLAP (Contrastive Language-Audio Pretraining) embeddings to project text queries and audio features into a shared space, then a ResUNet decoder performs the separation. Enables open-vocabulary separation rather than fixed stem categories. Presented on arXiv Aug 2023 (1.9k GitHub stars).
Key Claims¶
- Text-conditional separation works for open-vocabulary queries
- CLAP joint embedding provides effective conditioning signal for separation
- Single model handles diverse sound types (music, environmental, speech)
- Query-based separation is more flexible than fixed-stem approaches
Relevance to Bluegrass¶
Most intriguing tool for bluegrass separation. In theory, querying "banjo," "mandolin," or "fiddle" could separate those instruments without needing instrument-specific models. However: (1) CLAP was trained on general audio, not bluegrass vocabulary; (2) separation quality for fine-grained instrument distinction on acoustic folk music is untested; (3) the model may conflate similar-timbre string instruments.
Worth testing on bluegrass recordings — if it works even moderately well, it bypasses the need for custom-trained bluegrass separation models.
GitHub: Audio-AGI/AudioSep (1.9k stars). LGPL license. Pretrained checkpoints available.
Related¶
- ../entities/audiosef — entity page
- ../entities/clap — underlying audio-language model
- ../concepts/query-based-source-separation — concept page
- ../entities/demucs — alternative: fixed-stem separation