Topic: Transcription & Pitch Estimation¶

Overview¶

Automatic music transcription (AMT) and pitch estimation using neural networks and other automated technologies. Covers polyphonic instrument transcription, fundamental frequency (f0) estimation, multi-instrument transformer models, and guitar-specific tablature transcription.

Sub-topics / Concepts¶

../concepts/spectrogram-unets — U-Net architectures applied to spectrograms for transcription
../concepts/musicxml-tab-notation — tab notation representation for guitar
../concepts/self-supervised-audio-representation — pretraining strategies relevant to transcription

Key Entities¶

Models & Systems¶

../entities/basic-pitch (Spotify) — Lightweight pitch and note transcription. Architecture: harmonic stacking + CNN. Fast, intended for consumer use.
../entities/mt3 (Google) — Multi-Instrument Transformer. Token-based approach treating transcription as seq2seq. Handles multiple instruments in a single model.
../entities/onsets-and-frames (Google Magenta) — Piano transcription combining onset detection with frame-level note prediction. BiLSTM + CNN architecture.
../entities/crepe — Convolutional Representation for Pitch Estimation. Deep CNN for monophonic pitch (f0) estimation. Frame-level predictions.
../entities/bytedance-piano-transcription — ByteDance piano transcription system. High-resolution piano transcription.
../entities/tabcnn — TabCNN: CNN-based guitar tablature transcription. Predicts string/fret directly from CQT spectrograms.

Datasets¶

../entities/guitarset — Guitar dataset with hexaphonic pickup recordings, annotated with string-level transcriptions, playing technique, and more.

Sources¶

None ingested yet — seed batch setup.

Open Questions¶

How do MT3 and Basic Pitch compare on polyphonic instrument mixtures vs. solo piano?
What is the state of the art for guitar tab transcription — TabCNN vs. MT3 vs. newer approaches?
Can transcription models trained on isolated stems generalize to mixture inputs?
How much does source separation preprocessing improve downstream transcription accuracy?