Topic: Audio Classification & Representation¶
Overview¶
Neural network models for audio tagging, sound event detection, and learning general-purpose audio representations. Covers supervised models (YAMNet, PANNs), self-supervised approaches (BYOL-A, MAE-AST), audio-language models (CLAP), and the datasets that drive them.
Sub-topics / Concepts¶
- ../concepts/self-supervised-audio-representation — Learning audio embeddings without labeled data. BYOL-A, AST variants.
Key Entities¶
Models¶
- ../entities/panns — PANNs: Large-scale pretrained audio neural networks. CNN-based architectures (CNN14, CNN10) pretrained on AudioSet. Strong audio tagging baselines.
- ../entities/yamnet — YAMNet: Google MobileNet-based audio event classifier. Trained on AudioSet with 521 audio event classes. Lightweight, runs on-device.
- ../entities/ast / ../entities/ssast / ../entities/mae-ast — Audio Spectrogram Transformer family. ViT applied to spectrogram patches. AST = supervised on AudioSet; SSAST = self-supervised (joint discriminative + generative); MAE-AST = masked autoencoder pretraining.
- ../entities/byol-a — BYOL for Audio: Bootstrap Your Own Latent for audio representation learning. Self-supervised, no negative pairs.
- ../entities/clap — Contrastive Language-Audio Pretraining. Joint embedding for audio and text. Enables text-queried audio tasks.
Datasets (see also ../topics/datasets)¶
- ../entities/audioset — Google large-scale ontology-backed audio event dataset (~2M 10s clips, 527 classes).
- ../entities/audiocaps — AudioCaps: ~50K audio clips with human-written captions.
- ../entities/clotho — Clotho: audio captioning dataset with 5 captions per clip.
Sources¶
None ingested yet — seed batch setup.
Open Questions¶
- How do SSAST/MAE-AST representations compare to BYOL-A on downstream music tasks?
- Can CLAP embeddings usefully drive conditional source separation for arbitrary instrument queries?
- What is the best representation for music transcription downstream — spectrogram, learned embedding, raw waveform?