Self-Supervised Audio Representation Learning¶
Definition¶
Learning general-purpose audio representations without labeled data, using pretext tasks that derive supervision from the audio signal itself. Enables pretraining on massive unlabeled corpora, then fine-tuning on small labeled downstream tasks.
Key Ideas¶
- BYOL-A: Bootstrap Your Own Latent for audio. Contrastive learning without negative pairs — learns by predicting a target network's output from an augmented input.
- SSAST: Self-Supervised Audio Spectrogram Transformer. Joint discriminative (contrastive) and generative (masked reconstruction) objectives on spectrogram patches.
- MAE-AST: Masked Autoencoder AST. Masks large portions of spectrogram patches, reconstructs missing content. Follows ImageMAE paradigm.
- Key advantage: outperforms supervised pretraining when labeled data is scarce.
Relationships¶
- Related to [[byol-a]], [[ssast]], [[mae-ast]]
- Connects to clap which adds language supervision on top of audio representations
Sources¶
None ingested yet — seed batch setup.