Self-Supervised Audio Representation Learning¶

Definition¶

Learning general-purpose audio representations without labeled data, using pretext tasks that derive supervision from the audio signal itself. Enables pretraining on massive unlabeled corpora, then fine-tuning on small labeled downstream tasks.

Key Ideas¶

BYOL-A: Bootstrap Your Own Latent for audio. Contrastive learning without negative pairs — learns by predicting a target network's output from an augmented input.
SSAST: Self-Supervised Audio Spectrogram Transformer. Joint discriminative (contrastive) and generative (masked reconstruction) objectives on spectrogram patches.
MAE-AST: Masked Autoencoder AST. Masks large portions of spectrogram patches, reconstructs missing content. Follows ImageMAE paradigm.
Key advantage: outperforms supervised pretraining when labeled data is scarce.

Relationships¶

Related to [[byol-a]], [[ssast]], [[mae-ast]]
Connects to clap which adds language supervision on top of audio representations

Sources¶

../entities/byol-a — BYOL for Audio
../entities/ssast — Self-Supervised AST
../entities/clap — CLAP: contrastive language-audio pretraining
PESTO (Riou et al., 2025): self-supervised pitch estimation — mentioned in transcription survey