Abstract
A challenging task when developing real-time Automatic Music Transcription (AMT) methods is directly leveraging inputs from multichannel raw audio without any handcrafted signal transformation and feature extraction steps. The crucial problems are that raw audio only contains an amplitude in each timestamp, and the signals of the left and right channels have different amplitude intensities and onset times. Thus, this study addressed these issues by proposing the IRawNet method with fused feature layers to merge different amplitude from multichannel raw audio. IRawNet aims to transcribe Indonesian classical music notes. It was validated with the Gamelan music dataset. The Synthetic Minority Oversampling Technique (SMOTE) overcame the class imbalance of the Gamelan music dataset. Under various experimental scenarios, the performance effects of oversampled data, hyperparameters tuning, and fused feature layers are analyzed. Furthermore, the performance of the proposed method was compared with Temporal Convolutional Network (TCN), Deep WaveNet, and the monochannel IRawNet. The results proved that proposed method almost achieves superior results in entire metric performances with 0.871 of accuracy, 0.988 of AUC, 0.927 of precision, 0.896 of recall, and 0.896 of F1 score.
Funder
Lembaga Pengelola Dana Pendidikan
Publisher
EMITTER International Journal of Engineering Technology
Reference31 articles.
1. E. Benetos, S. Dixon, Z. Duan, and S. Ewert, Automatic music transcription: An overview, IEEE Signal Process. Mag., vol. 36, no. 1, pp. 20–30, 2018.
2. A. van den Oord et al., {WaveNet}: A Generative Model for Raw Audio, no. {arXiv}:1609.03499. 2016. Accessed: Jul. 15, 2022. [Online]. Available: http://arxiv.org/abs/1609.03499
3. S. Bai, J. Z. Kolter, and V. Koltun, An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, ArXiv, vol. abs/1803.0, 2018.
4. E. P. MatthewDavies and S. Böck, Temporal convolutional networks for musical audio beat tracking, in 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5, 2019.
5. L. S. Martak, M. Sajgalik, and W. Benesova, Polyphonic note transcription of time-domain audio signal with deep wavenet architecture, in 2018 25th International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1–5, 2018.