End-to-end Visual Speech Recognition for Human-Robot Interaction-Reference-Cited by-同舟云学术

End-to-end Visual Speech Recognition for Human-Robot Interaction

Published:2022 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of IV International Scientific Conference MIP: Engineering-IV-2022: Modernization, Innovations, Progress: Advanced Technologies in Material Science, Mechanical and Automation Engineering
language:
Short-container-title:

Author:

Ivanko Denis, ,Ryumin Dmitry,Markitantov Maxim, ,

Abstract

In this paper we present a novel method designed for word-level visual speech recognition and intended for use in human-robot interaction. The ability of robots to understand natural human speech will significantly improve the quality of human-machine interaction. Despite outstanding breakthroughs achieved in this field in recent years this challenge remains unresolved. In current research we mainly focus on the visual part of the human speech, so-called automated lip-reading task, which becomes crucial for human-robot interaction in acoustically noisy environment. The developed method is based on the use of state-of-the-art artificial intelligence technologies and allowed to achieve an incredible 85.03% speech recognition accuracy using only video data. It is worth noting that the model training and testing of the method was carried out on a benchmarking LRW database recorded inthe-wild, and the presented results surpass many existing achieved by the researchers of the world speech recognition community.

Publisher

Krasnoyarsk Science and Technology City Hall

Reference22 articles.

1. 1. Dalu, F. "Learn an Effective Lip Reading Model without Pains" / F. Dalu, S. Yang, S. Shan and X. Chen // In arXiv preprint arXiv:2011.07557. - 2020. - P. 1-6.

2. 2. Kim, M. "Multi-modality associative bridging through memory: Speech sound recollected from face video" / M. Kim, J. Hong, S. J. Park, Y. M. Ro // In Proceedings of the IEEE/CVF International Conference on Computer Vision. - 2021. - P. 296-306.

3. 3. Martinez, B. "Lipreading using temporal convolutional networks" / B. Martinez, P. Ma, S. Petridis, M. Pantic // In ICASSP IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). - 2020. - P. 6319-6323.

4. 4. Zhang, Y. "Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition" / Y. Zhang, S. Yang, J. Xiao, S. Shan and X. Chen // In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition. - 2020. - P. 356-363.

5. 5. Xu, B. "Discriminative multi-modality speech recognition" / B. Xu, C. Lu, Y. Guo and J. Wang // In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. - 2020. - P. 14433-14442.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Hybrid Campus Security System Combined of Face, Number-Plate, and Voice Recognition;Communications in Computer and Information Science;2023