This demo page is for the paper ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data. Here, we want to demonstrate that our proposed semi-supervised framework for AMT can generalize better to unseen data than the fully supervised baseline. Both Pop Music and Concerto Grosso are not present in the labelled data, but our semi-supervised framework can still be trained on these unseen genre without labels.
Source code: https://github.com/KinWaiCheuk/ReconVAT
Details of data splitting: https://github.com/KinWaiCheuk/ReconVAT/blob/gh-pages/supplementary.pdf
Three models are compared here:
We downloaded Corelli's Concerto Grosso OP.6 No.1), No.2), and No.3) from ISMLP. As a preprossing step, we break down each of the concerto into shorter pieces movement by movement. These three compositions result in 17 different movements after the preporcessing, and we add them to our unlabelled data to train our proposed framework.
Since this type of music genre is not in the labelled training data in MusicNet, the transcription result produced by the supervised baseline model contains a lot of missing notes. Our proposed ReconVAT is much better than the baseline, and the ReconVAT trained on the new unlabelled data is even better in terms of melodic and the accompaniment details.
Ground Truth | ReconVAT+new data | ReconVAT | Baseline |
Again, our proposed ReconVAT is much better than the baseline, and the ReconVAT trained on the new unlabelled data is slightly better than the ReconVAT trained on only existing data.
Ground Truth | ReconVAT+new data | ReconVAT | Baseline |
Again, our proposed ReconVAT is much better than the baseline. But this time ReconVAT trained on existing data capature the details of the accompaniment better, while the ReconVAT trained on new data capature the melodic details better. We will explore ways to improve the continous training method in the future.
Ground Truth | ReconVAT+new data | ReconVAT | Baseline |
We downloaded a few pop music covers from Youtube, and add them to the unlabelled dataset to train our proposed framework.
A clarient cover for the song Lean On downloaded from Youtube. The following excerpt is extracted from 1:19-1:30.
The music transcription prodcued by the ReconVAT model is more detail than the baseline model. And the ReconVAT continue training on new data is slightly more accurate than the one without using the new data.
Ground Truth | ReconVAT+new data | ReconVAT | Baseline |
A clarient cover for the song Lemon downloaded from Youtube. The following excerpt is extracted from 3:30-3:40. The blue notes highlight the main melody and the green notes highlight the music accompaniment (paino and drums). Our ReconVAT is better than the baseline model in terms of melodic and accompaniment details. The transcription produced by our ReconVAT training on the new data is even finer in details.
Ground Truth | ReconVAT+new data | ReconVAT | Baseline |
A clarient cover for the song Lemon downloaded from Youtube. The following excerpt is extracted from 0:31-0:42. Again our ReconVAT is much better than the baseline model. But our ReconVAT trained on new data only has a subtle improvement for this piece, in which the note durations at the beginning of the music is slightly more accurate.
Ground Truth | ReconVAT+new data | ReconVAT | Baseline |