KMSAV: Korean multi‐speaker spontaneous audiovisual dataset-Reference-Cited by-同舟云学术

KMSAV: Korean multi‐speaker spontaneous audiovisual dataset

Published:2024-02 Issue:1 Volume:46 Page:71-81
ISSN:1225-6463
Container-title:ETRI Journal
language:en
Short-container-title:ETRI Journal

Author:

Park Kiyoung¹²^ORCID,Oh Changhan¹²^ORCID,Dong Sunghee¹^ORCID

Affiliation:

1. Superintelligence Creative Research Laboratory Electronics and Telecommunications Research Institute Daejeon Republic of Korea

2. Integrated Intelligence Research Laboratory University of Science and Technology Daejeon Republic of Korea

Abstract

AbstractRecent advances in deep learning for speech and visual recognition have accelerated the development of multimodal speech recognition, yielding many innovative results. We introduce a Korean audiovisual speech recognition corpus. This dataset comprises approximately 150 h of manually transcribed and annotated audiovisual data supplemented with additional 2000 h of untranscribed videos collected from YouTube under the Creative Commons License. The dataset is intended to be freely accessible for unrestricted research purposes. Along with the corpus, we propose an open‐source framework for automatic speech recognition (ASR) and audiovisual speech recognition (AVSR). We validate the effectiveness of the corpus with evaluations using state‐of‐the‐art ASR and AVSR techniques, capitalizing on both pretrained models and fine‐tuning processes. After fine‐tuning, ASR and AVSR achieve character error rates of 11.1% and 18.9%, respectively. This error difference highlights the need for improvement in AVSR techniques. We expect that our corpus will be an instrumental resource to support improvements in AVSR.

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.4218/etrij.2023-0352

Reference30 articles.

1. Deep Audio-visual Speech Recognition

2. S.Petridis T.Stafylakis P.Ma F.Cai G.Tzimiropoulos andM.Pantic End‐to‐end audiovisual speech recognition (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) Calgary Canada) 2018 pp.6548–6552.

3. An Overview of Noise-Robust Automatic Speech Recognition

4. J.Chung A.Senior O.Vinyals andA.Zisserman Lip reading sentences in the wild (IEEE Conf. Comput. Vision Pattern Recognit. (CVPR) Honolulu HI USA) 2017 pp.3444–3453.

5. P.Ma A.Haliassos A.Fernandez‐Lopez H.Chen S.Petridis andM.Pantic Auto‐AVSR: audio‐visual speech recognition with automatic labels (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) Rhodes Island Greece) 2023 pp.1–5.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Special issue on speech and language AI technologies;ETRI Journal;2024-02