Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications-Reference-Cited by-同舟云学术

Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications

Published:2022-10-12 Issue:20 Volume:22 Page:7738
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Jeon Sanghun^ORCID,Kim Mun Sang^ORCID

Abstract

Speech is a commonly used interaction-recognition technique in edutainment-based systems and is a key technology for smooth educational learning and user–system interaction. However, its application to real environments is limited owing to the various noise disruptions in real environments. In this study, an audio and visual information-based multimode interaction system is proposed that enables virtual aquarium systems that use speech to interact to be robust to ambient noise. For audio-based speech recognition, a list of words recognized by a speech API is expressed as word vectors using a pretrained model. Meanwhile, vision-based speech recognition uses a composite end-to-end deep neural network. Subsequently, the vectors derived from the API and vision are classified after concatenation. The signal-to-noise ratio of the proposed system was determined based on data from four types of noise environments. Furthermore, it was tested for accuracy and efficiency against existing single-mode strategies for extracting visual features and audio speech recognition. Its average recognition rate was 91.42% when only speech was used, and improved by 6.7% to 98.12% when audio and visual information were combined. This method can be helpful in various real-world settings where speech recognition is regularly utilized, such as cafés, museums, music halls, and kiosks.

Funder

National Research Foundation of Korea (NRF) grant funded by the Korea government

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/22/20/7738/pdf

Reference63 articles.

1. Framing the Design Space of Multimodal Mid-Air Gesture and Speech-Based Interaction With Mobile Devices for Older People

2. Lifelong robot edutainment based on self-efficacy;Kaburagi;Proceedings of the 2021 5th IEEE International Conference on Cybernetics (CYBCONF),2021

3. AI applications on music technology for edutainment;Soo;Proceedings of the International Conference on Innovative Technologies and Learning,2018

4. A sketch classifier technique with deep learning models realized in an embedded system;Tsai;Proceedings of the 2019 IEEE 22nd International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS),2019

5. Educational values in factual nature pictures;Disney;Educ. Horiz.,1954

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Evaluation Method for Virtual Museum Interface Integrating Layout Aesthetics and Visual Cognitive Characteristics Based on Improved Gray H-Convex Correlation Model;Applied Sciences;2024-08-09

2. Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems;ETRI Journal;2024-02

3. Audio-Visual Self-Supervised Representation Learning: A Survey;2024

4. Audio–Visual Fusion Based on Interactive Attention for Person Verification;Sensors;2023-12-15

5. The Use of Correlation Features in the Problem of Speech Recognition;Algorithms;2023-02-07