Mixture Invariant Training (MixIT)¶

Definition¶

Training paradigm for source separation that uses mixtures of mixtures (MoMs) as input instead of requiring isolated ground-truth sources. The model separates a MoM into its constituent mixtures, which are compared against the known sub-mixtures.

Key Ideas¶

Wisdom, Hershey et al. (Google). Take two mixtures, sum them = MoM. Model separates MoM into estimates of the original mixtures.
Loss: compare estimated outputs to original mixtures (not to isolated sources).
Key advantage: does not require isolated stems for training. Can use any audio.
Works because the mapping from MoM to constituent mixtures is unambiguous (no permutation problem).
Enables training on in-the-wild data at massive scale.

Relationships¶

From john-hershey's group at Google
Related to ../concepts/permutation-invariant-training, ../concepts/deep-clustering-separation
Contrast with ../concepts/synthetic-mixing-pipelines — MixIT doesn't need isolated stems
Implemented in some ../entities/asteroid recipes

Sources¶

Wisdom, Hershey et al.: "Unsupervised Sound Separation Using Mixture Invariant Training" (NeurIPS 2020) — original MixIT paper
../entities/asteroid — includes MixIT training recipes