Topic: Transcription & Pitch Estimation¶
Overview¶
Automatic music transcription (AMT) and pitch estimation using neural networks and other automated technologies. Covers polyphonic instrument transcription, fundamental frequency (f0) estimation, multi-instrument transformer models, and guitar-specific tablature transcription.
Sub-topics / Concepts¶
- ../concepts/spectrogram-unets — U-Net architectures applied to spectrograms for transcription
- ../concepts/musicxml-tab-notation — tab notation representation for guitar
- ../concepts/self-supervised-audio-representation — pretraining strategies relevant to transcription
Key Entities¶
Models & Systems¶
- ../entities/basic-pitch (Spotify) — Lightweight pitch and note transcription. Architecture: harmonic stacking + CNN. Fast, intended for consumer use.
- ../entities/mt3 (Google) — Multi-Instrument Transformer. Token-based approach treating transcription as seq2seq. Handles multiple instruments in a single model.
- ../entities/onsets-and-frames (Google Magenta) — Piano transcription combining onset detection with frame-level note prediction. BiLSTM + CNN architecture.
- ../entities/crepe — Convolutional Representation for Pitch Estimation. Deep CNN for monophonic pitch (f0) estimation. Frame-level predictions.
- ../entities/bytedance-piano-transcription — ByteDance piano transcription system. High-resolution piano transcription.
- ../entities/tabcnn — TabCNN: CNN-based guitar tablature transcription. Predicts string/fret directly from CQT spectrograms.
Datasets¶
- ../entities/guitarset — Guitar dataset with hexaphonic pickup recordings, annotated with string-level transcriptions, playing technique, and more.
Sources¶
None ingested yet — seed batch setup.
Open Questions¶
- How do MT3 and Basic Pitch compare on polyphonic instrument mixtures vs. solo piano?
- What is the state of the art for guitar tab transcription — TabCNN vs. MT3 vs. newer approaches?
- Can transcription models trained on isolated stems generalize to mixture inputs?
- How much does source separation preprocessing improve downstream transcription accuracy?