Speech-Section Extraction Using Lip Movement and Voice Information in Japanese-Reference-Cited by-同舟云学术

Speech-Section Extraction Using Lip Movement and Voice Information in Japanese

Published:2023-01-20 Issue:1 Volume:27 Page:54-63
ISSN:1883-8014
Container-title:Journal of Advanced Computational Intelligence and Intelligent Informatics
language:en
Short-container-title:JACIII

Author:

Nakamura Etsuro¹,Kageyama Yoichi¹^ORCID,Hirose Satoshi²

Affiliation:

1. Graduate School of Engineering Science, Akita University, 1-1 Tegata Gakuen-Machi, Akita 010-8502, Japan

2. Japan Business Systems, Inc., 1-23-1 Toranomon, Minato-ku, Tokyo 105-6316, Japan

Abstract

In recent years, several Japanese companies have attempted to improve the efficiency of their meetings, which has been a significant challenge. For instance, voice recognition technology is used to considerably improve meeting minutes creation. In an automatic minutes-creating system, identifying the speaker to add speaker information to the text would substantially improve the overall efficiency of the process. Therefore, a few companies and research groups have proposed speaker estimation methods; however, it includes challenges, such as requiring advance preparation, special equipment, and multiple microphones. These problems can be solved by using speech sections that are extracted from lip movements and voice information. When a person speaks, voice and lip movements occur simultaneously. Therefore, the speaker’s speech section can be extracted from videos by using lip movement and voice information. However, when this speech section contains only voice information, the voiceprint information of each meeting participant is required for speaker identification. When using lip movements, the speech section and speaker position can be extracted without the voiceprint information. Therefore, in this study, we propose a speech-section extraction method that uses image and voice information in Japanese for speaker identification. The proposed method consists of three processes: i) the extraction of speech frames using lip movements, ii) the extraction of speech frames using voices, and iii) the classification of speech sections using these extraction results. We used video data to evaluate the functionality of the method. Further, the proposed method was compared with state-of-the-art techniques. The average F-measure of the proposed method is determined to be higher than that of the conventional methods that are based on state-of-the-art techniques. The evaluation results showed that the proposed method achieves state-of-the-art performance using a simpler process compared to the conventional method.

Funder

Japan Society for the Promotion of Science

Publisher

Fuji Technology Press Ltd.

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Human-Computer Interaction

Reference34 articles.

1. S. Han, Z. Yang, Q. Li, and Y. Chen, “Deformed landmark fitting for sequential faces,” J. Vis. Commun. Image Represent., Vol.62, pp. 381-393, 2019.

2. Z. Meng, M. U. B. Altaf, and B.-H. Juang, “Active voice authentication,” Digit. Signal Process., Vol.101, Article No.102672, 2020.

3. Y. Shi, Z. Zhang, H. Kaining, W. Ma, and S. Tu, “Human-computer interaction based on face feature localization,” J. Vis. Commun. Image Represent., Vol.70, Article No.102740, 2020.

4. R. Kharghanian, A. Peiravi, F. Moradi, and A. Iosifidis, “Pain detection using batch normalized discriminant restricted Boltzmann machine layers,” J. Vis. Commun. Image Represent., Vol.76, Article No.103062, 2021.

5. A. Othmani, A. R. Taleb, H. Abdelkawy, and A. Hadid, “Age estimation from faces using deep learning: A comparative analysis,” Comput. Vis. Image Underst., Vol.196, Article No.102961, 2020.