Query-Based Source Separation¶

Definition¶

Source separation guided by a user-provided query — text description (e.g., "separate the violin"), audio example, or other conditioning signal — rather than separating into predefined stem categories.

Key Ideas¶

Contrast with fixed-stem separation (e.g., Spleeter 4-stem: drums/bass/vocals/other). Query-based is open-vocabulary.
Enabled by joint audio-text embeddings (CLAP) — the query is projected into the same space as audio features.
Key systems: AudioSep (uses CLAP embeddings), SoundFilter (learned filter networks).

Relationships¶

Builds on clap for text-audio alignment
Related to audiosef and [[soundfilter]]
Contrasts with ../concepts/informed-model-based-separation — query-based is open-vocabulary; informed separation uses structured side information (score, MIDI)

Sources¶

../sources/2023-08-10-audiosef — AudioSep: text-conditional separation using CLAP embeddings
../entities/audiosef — primary implementation