AudioSep¶

About¶

Text-conditional source separation model from University of Surrey and collaborators. Uses CLAP (Contrastive Language-Audio Pretraining) embeddings to separate sources described by natural language queries. A user types "banjo" or "separate the fiddle" and the model isolates that source. ResUNet decoder conditioned on CLAP text embeddings. Pretrained checkpoints available.

Relevance¶

Most novel approach for bluegrass stem separation. The open-vocabulary query capability means it could potentially separate "banjo" from a mix without banjo-specific training. However, CLAP was trained on general audio (AudioSet, not bluegrass vocabulary), so performance on acoustic string-band music is unknown. Worth testing: does "banjo" as a query actually isolate banjo from a bluegrass recording?

Mentions¶

../sources/2023-08-10-audiosef — paper
../entities/clap — CLAP: underlying audio-text model
../concepts/query-based-source-separation — broader concept

Links¶

GitHub: https://github.com/Audio-AGI/AudioSep (1.9k stars)
License: LGPL