Skip to content

AudioSep

Summary

Text-conditional source separation system that accepts natural language queries (e.g., "separate the banjo") and isolates the described sound source from a mixture. Uses CLAP (Contrastive Language-Audio Pretraining) embeddings to project text queries and audio features into a shared space, then a ResUNet decoder performs the separation. Enables open-vocabulary separation rather than fixed stem categories. Presented on arXiv Aug 2023 (1.9k GitHub stars).

Key Claims

  • Text-conditional separation works for open-vocabulary queries
  • CLAP joint embedding provides effective conditioning signal for separation
  • Single model handles diverse sound types (music, environmental, speech)
  • Query-based separation is more flexible than fixed-stem approaches

Relevance to Bluegrass

Most intriguing tool for bluegrass separation. In theory, querying "banjo," "mandolin," or "fiddle" could separate those instruments without needing instrument-specific models. However: (1) CLAP was trained on general audio, not bluegrass vocabulary; (2) separation quality for fine-grained instrument distinction on acoustic folk music is untested; (3) the model may conflate similar-timbre string instruments.

Worth testing on bluegrass recordings — if it works even moderately well, it bypasses the need for custom-trained bluegrass separation models.

GitHub: Audio-AGI/AudioSep (1.9k stars). LGPL license. Pretrained checkpoints available.