Skip to content

Topic: Audio Classification & Representation

Overview

Neural network models for audio tagging, sound event detection, and learning general-purpose audio representations. Covers supervised models (YAMNet, PANNs), self-supervised approaches (BYOL-A, MAE-AST), audio-language models (CLAP), and the datasets that drive them.

Sub-topics / Concepts

Key Entities

Models

  • ../entities/panns — PANNs: Large-scale pretrained audio neural networks. CNN-based architectures (CNN14, CNN10) pretrained on AudioSet. Strong audio tagging baselines.
  • ../entities/yamnet — YAMNet: Google MobileNet-based audio event classifier. Trained on AudioSet with 521 audio event classes. Lightweight, runs on-device.
  • ../entities/ast / ../entities/ssast / ../entities/mae-ast — Audio Spectrogram Transformer family. ViT applied to spectrogram patches. AST = supervised on AudioSet; SSAST = self-supervised (joint discriminative + generative); MAE-AST = masked autoencoder pretraining.
  • ../entities/byol-a — BYOL for Audio: Bootstrap Your Own Latent for audio representation learning. Self-supervised, no negative pairs.
  • ../entities/clap — Contrastive Language-Audio Pretraining. Joint embedding for audio and text. Enables text-queried audio tasks.

Datasets (see also ../topics/datasets)

Sources

None ingested yet — seed batch setup.

Open Questions

  • How do SSAST/MAE-AST representations compare to BYOL-A on downstream music tasks?
  • Can CLAP embeddings usefully drive conditional source separation for arbitrary instrument queries?
  • What is the best representation for music transcription downstream — spectrogram, learned embedding, raw waveform?