ACM MM 2021 Demo Page¶

This demo page is for the paper ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data. Here, we want to demonstrate that our proposed semi-supervised framework for AMT can generalize better to unseen data than the fully supervised baseline. Both Pop Music and Concerto Grosso are not present in the labelled data, but our semi-supervised framework can still be trained on these unseen genre without labels.

Source code: https://github.com/KinWaiCheuk/ReconVAT

Details of data splitting: https://github.com/KinWaiCheuk/ReconVAT/blob/gh-pages/supplementary.pdf

Three models are compared here:

ReconVAT (existing + new data): The proposed semi-supervised AMT framework based on spectrogram reconstruction [1] and VAT [2]. It is trained using existing data for 4k epoches, then add the music downloaded from Youtube or ISMLP as the unlabelled data and train for another 4k epoches.
ReconVAT (existing): The proposed semi-supervised AMT framework trained using existing data for 8k epoches.
Baseline: A fully supervied model [3] trained using existing data for 8k epoches.

String¶

We downloaded Corelli's Concerto Grosso OP.6 No.1), No.2), and No.3) from ISMLP. As a preprossing step, we break down each of the concerto into shorter pieces movement by movement. These three compositions result in 17 different movements after the preporcessing, and we add them to our unlabelled data to train our proposed framework.

Corelli Op6 No1 mvt4 Allegro¶

Since this type of music genre is not in the labelled training data in MusicNet, the transcription result produced by the supervised baseline model contains a lot of missing notes. Our proposed ReconVAT is much better than the baseline, and the ReconVAT trained on the new unlabelled data is even better in terms of melodic and the accompaniment details.

Ground Truth	ReconVAT+new data	ReconVAT	Baseline

Corelli Op6 No2 mvt4 Allegro¶

Again, our proposed ReconVAT is much better than the baseline, and the ReconVAT trained on the new unlabelled data is slightly better than the ReconVAT trained on only existing data.

Ground Truth	ReconVAT+new data	ReconVAT	Baseline

Corelli Op6 No3 mvt5 Allegro¶

Again, our proposed ReconVAT is much better than the baseline. But this time ReconVAT trained on existing data capature the details of the accompaniment better, while the ReconVAT trained on new data capature the melodic details better. We will explore ways to improve the continous training method in the future.

Ground Truth	ReconVAT+new data	ReconVAT	Baseline

Woodwind¶

We downloaded a few pop music covers from Youtube, and add them to the unlabelled dataset to train our proposed framework.

LeanOn¶

A clarient cover for the song Lean On downloaded from Youtube. The following excerpt is extracted from 1:19-1:30.

The music transcription prodcued by the ReconVAT model is more detail than the baseline model. And the ReconVAT continue training on new data is slightly more accurate than the one without using the new data.

Ground Truth	ReconVAT+new data	ReconVAT	Baseline

Lemon¶

A clarient cover for the song Lemon downloaded from Youtube. The following excerpt is extracted from 3:30-3:40. The blue notes highlight the main melody and the green notes highlight the music accompaniment (paino and drums). Our ReconVAT is better than the baseline model in terms of melodic and accompaniment details. The transcription produced by our ReconVAT training on the new data is even finer in details.

Ground Truth	ReconVAT+new data	ReconVAT	Baseline

Yoasobi¶

A clarient cover for the song Lemon downloaded from Youtube. The following excerpt is extracted from 0:31-0:42. Again our ReconVAT is much better than the baseline model. But our ReconVAT trained on new data only has a subtle improvement for this piece, in which the note durations at the beginning of the music is slightly more accurate.

Ground Truth	ReconVAT+new data	ReconVAT	Baseline