Topic: Audio Classification & Representation¶

Overview¶

Neural network models for audio tagging, sound event detection, and learning general-purpose audio representations. Covers supervised models (YAMNet, PANNs), self-supervised approaches (BYOL-A, MAE-AST), audio-language models (CLAP), and the datasets that drive them.

Sub-topics / Concepts¶

../concepts/self-supervised-audio-representation — Learning audio embeddings without labeled data. BYOL-A, AST variants.

Key Entities¶

Models¶

../entities/panns — PANNs: Large-scale pretrained audio neural networks. CNN-based architectures (CNN14, CNN10) pretrained on AudioSet. Strong audio tagging baselines.
../entities/yamnet — YAMNet: Google MobileNet-based audio event classifier. Trained on AudioSet with 521 audio event classes. Lightweight, runs on-device.
../entities/ast / ../entities/ssast / ../entities/mae-ast — Audio Spectrogram Transformer family. ViT applied to spectrogram patches. AST = supervised on AudioSet; SSAST = self-supervised (joint discriminative + generative); MAE-AST = masked autoencoder pretraining.
../entities/byol-a — BYOL for Audio: Bootstrap Your Own Latent for audio representation learning. Self-supervised, no negative pairs.
../entities/clap — Contrastive Language-Audio Pretraining. Joint embedding for audio and text. Enables text-queried audio tasks.

Datasets (see also ../topics/datasets)¶

../entities/audioset — Google large-scale ontology-backed audio event dataset (~2M 10s clips, 527 classes).
../entities/audiocaps — AudioCaps: ~50K audio clips with human-written captions.
../entities/clotho — Clotho: audio captioning dataset with 5 captions per clip.

Sources¶

None ingested yet — seed batch setup.

Open Questions¶

How do SSAST/MAE-AST representations compare to BYOL-A on downstream music tasks?
Can CLAP embeddings usefully drive conditional source separation for arbitrary instrument queries?
What is the best representation for music transcription downstream — spectrogram, learned embedding, raw waveform?