Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems-Reference-Cited by-同舟云学术

Multimodal audiovisual speech recognition architecture using a three‐feature multi‐fusion method for noise‐robust systems

Published:2024-02 Issue:1 Volume:46 Page:22-34
ISSN:1225-6463
Container-title:ETRI Journal
language:en
Short-container-title:ETRI Journal

Author:

Jeon Sanghun¹²^ORCID,Lee Jieun²,Yeo Dohyeon²,Lee Yong‐Ju¹,Kim SeungJun²

Affiliation:

1. Electronics and Telecommunications Research Institute Daejeon Republic of Korea

2. Gwangju Institute of Science and Technology School of Integrated Technology Gwangju Republic of Korea

Abstract

AbstractExposure to varied noisy environments impairs the recognition performance of artificial intelligence‐based speech recognition technologies. Degraded‐performance services can be utilized as limited systems that assure good performance in certain environments, but impair the general quality of speech recognition services. This study introduces an audiovisual speech recognition (AVSR) model robust to various noise settings, mimicking human dialogue recognition elements. The model converts word embeddings and log‐Mel spectrograms into feature vectors for audio recognition. A dense spatial–temporal convolutional neural network model extracts features from log‐Mel spectrograms, transformed for visual‐based recognition. This approach exhibits improved aural and visual recognition capabilities. We assess the signal‐to‐noise ratio in nine synthesized noise environments, with the proposed model exhibiting lower average error rates. The error rate for the AVSR model using a three‐feature multi‐fusion method is 1.711%, compared to the general 3.939% rate. This model is applicable in noise‐affected environments owing to its enhanced stability and recognition rate.

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.4218/etrij.2023-0266

Reference52 articles.

1. Hearing lips and seeing voices

2. Electroencephalography‐based imagined speech recognition using deep long short‐term memory network

3. Speech emotion recognition based on genetic algorithm–decision tree fusion of deep and acoustic features

4. Real‐time implementation and performance evaluation of speech classifiers in speech analysis‐synthesis

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Special issue on speech and language AI technologies;ETRI Journal;2024-02