Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing-Reference-Cited by-同舟云学术

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

Published:2024-05-16 Issue:7 Volume:20 Page:1-29
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Gu Xiangming¹^ORCID,Ou Longshen²^ORCID,Zeng Wei²^ORCID,Zhang Jianan²^ORCID,Wong Nicholas²^ORCID,Wang Ye²^ORCID

Affiliation:

1. National University of Singapore, Singapore, Singapore

2. National University of Singapore, Singapore Singapore

Abstract

Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.

Funder

Ministry of Education in Singapore

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3651310

Reference80 articles.

1. LRS3-TED: A large-scale dataset for visual speech recognition;Afouras Triantafyllos;arXiv preprint arXiv:1809.00496,2018

2. Víctor Arroyo, Jose J. Valero-Mas, Jorge Calvo-Zaragoza, and Antonio Pertusa. 2022. Neural audio-to-score music transcription for unconstrained polyphony using compact output representations. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 4603–4607.

3. wav2vec 2.0: A framework for self-supervised learning of speech representations;Baevski Alexei;Advances in Neural Information Processing Systems,2020

4. Sakya Basak, Shrutina Agarwal, Sriram Ganapathy, and Naoya Takahashi. 2021. End-to-end lyrics recognition with voice to singing style transfer. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 266–270.

5. Ke Chen, Shuai Yu, Cheng-i Wang, Wei Li, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2022. Tonet: Tone-octave network for singing melody extraction from polyphonic music. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 621–625.