EMOLIPS: Towards Reliable Emotional Speech Lip-Reading-Reference-Cited by-同舟云学术

EMOLIPS: Towards Reliable Emotional Speech Lip-Reading

Published:2023-11-27 Issue:23 Volume:11 Page:4787
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Ryumin Dmitry¹^ORCID,Ryumina Elena¹^ORCID,Ivanko Denis¹^ORCID

Affiliation:

1. St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), 199178 St. Petersburg, Russia

Abstract

In this article, we present a novel approach for emotional speech lip-reading (EMOLIPS). This two-level approach to emotional speech to text recognition based on visual data processing is motivated by human perception and the recent developments in multimodal deep learning. The proposed approach uses visual speech data to determine the type of speech emotion. The speech data are then processed using one of the emotional lip-reading models trained from scratch. This essentially resolves the multi-emotional lip-reading issue associated with most real-life scenarios. We implemented these models as a combination of EMO-3DCNN-GRU architecture for emotion recognition and 3DCNN-BiLSTM architecture for automatic lip-reading. We evaluated the models on the CREMA-D and RAVDESS emotional speech corpora. In addition, this article provides a detailed review of recent advances in automated lip-reading and emotion recognition that have been developed over the last 5 years (2018–2023). In comparison to existing research, we mainly focus on the valuable progress brought with the introduction of deep learning to the field and skip the description of traditional approaches. The EMOLIPS approach significantly improves the state-of-the-art accuracy for phrase recognition due to considering emotional features of the pronounced audio-visual speech up to 91.9% and 90.9% for RAVDESS and CREMA-D, respectively. Moreover, we present an extensive experimental investigation that demonstrates how different emotions (happiness, anger, disgust, fear, sadness, and neutral), valence (positive, neutral, and negative) and binary (emotional and neutral) affect automatic lip-reading.

Funder

Russian Science Foundation

Grant

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/11/23/4787/pdf

Reference115 articles.

1. Audio-Visual and Multimodal Speech Systems;Benoit;Handb. Stand. Resour. Spok. Lang. Syst.-Suppl.,2000

2. Audiovisual Speech Processing;Chen;IEEE Signal Process. Mag.,2001

3. Acquisition of Second-Language Speech: Effects of Visual cues, Context, and Talker Variability;Hardison;Appl. Psycholinguist.,2003

4. The Processing of Audio-Visual Speech: Empirical and Neural bases;Campbell;Philos. Trans. R. Soc. B Biol. Sci.,2008

5. Origin and Evolution of Human Speech: Emergence from a Trimodal Auditory, Visual and Vocal Network;Michon;Prog. Brain Res.,2019

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Script Generation for Silent Speech in E-Learning;Advances in Educational Technologies and Instructional Design;2024-06-03