Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing

Author:

Gu Xiangming1ORCID,Ou Longshen2ORCID,Zeng Wei2ORCID,Zhang Jianan2ORCID,Wong Nicholas2ORCID,Wang Ye2ORCID

Affiliation:

1. National University of Singapore, Singapore, Singapore

2. National University of Singapore, Singapore Singapore

Abstract

Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics, while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.

Funder

Ministry of Education in Singapore

Publisher

Association for Computing Machinery (ACM)

Reference80 articles.

1. LRS3-TED: A large-scale dataset for visual speech recognition;Afouras Triantafyllos;arXiv preprint arXiv:1809.00496,2018

2. Víctor Arroyo, Jose J. Valero-Mas, Jorge Calvo-Zaragoza, and Antonio Pertusa. 2022. Neural audio-to-score music transcription for unconstrained polyphony using compact output representations. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 4603–4607.

3. wav2vec 2.0: A framework for self-supervised learning of speech representations;Baevski Alexei;Advances in Neural Information Processing Systems,2020

4. Sakya Basak, Shrutina Agarwal, Sriram Ganapathy, and Naoya Takahashi. 2021. End-to-end lyrics recognition with voice to singing style transfer. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 266–270.

5. Ke Chen, Shuai Yu, Cheng-i Wang, Wei Li, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2022. Tonet: Tone-octave network for singing melody extraction from polyphonic music. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 621–625.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3