Skip to content

Self-Supervised Audio Representation Learning

Definition

Learning general-purpose audio representations without labeled data, using pretext tasks that derive supervision from the audio signal itself. Enables pretraining on massive unlabeled corpora, then fine-tuning on small labeled downstream tasks.

Key Ideas

  • BYOL-A: Bootstrap Your Own Latent for audio. Contrastive learning without negative pairs — learns by predicting a target network's output from an augmented input.
  • SSAST: Self-Supervised Audio Spectrogram Transformer. Joint discriminative (contrastive) and generative (masked reconstruction) objectives on spectrogram patches.
  • MAE-AST: Masked Autoencoder AST. Masks large portions of spectrogram patches, reconstructs missing content. Follows ImageMAE paradigm.
  • Key advantage: outperforms supervised pretraining when labeled data is scarce.

Relationships

  • Related to [[byol-a]], [[ssast]], [[mae-ast]]
  • Connects to clap which adds language supervision on top of audio representations

Sources

None ingested yet — seed batch setup.