Speaker Localization Based on Audio-Visual Bimodal Fusion-Reference-Cited by-同舟云学术

Speaker Localization Based on Audio-Visual Bimodal Fusion

Published:2021-05-20 Issue:3 Volume:25 Page:375-382
ISSN:1883-8014
Container-title:Journal of Advanced Computational Intelligence and Intelligent Informatics
language:en
Short-container-title:JACIII

Author:

Zhu Ying-Xin,Jin Hao-Ran, , ,

Abstract

The demand for fluency in human–computer interaction is on an increase globally; thus, the active localization of the speaker by the machine has become a problem worth exploring. Considering that the stability and accuracy of the single-mode localization method are low, while the multi-mode localization method can utilize the redundancy of information to improve accuracy and anti-interference, a speaker localization method based on voice and image multimodal fusion is proposed. First, the voice localization method based on time differences of arrival (TDOA) in a microphone array and the face detection method based on the AdaBoost algorithm are presented herein. Second, a multimodal fusion method based on spatiotemporal fusion of speech and image is proposed, and it uses a coordinate system converter and frame rate tracker. The proposed method was tested by positioning the speaker stand at 15 different points, and each point was tested 50 times. The experimental results demonstrate that there is a high accuracy when the speaker stands in front of the positioning system within a certain range.

Publisher

Fuji Technology Press Ltd.

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Human-Computer Interaction

Reference28 articles.

1. G. Deak, K. Curran, and J. Condell, “A survey of active and passive indoor localisation systems,” Computer Communications, Vol.35, pp. 1939-1954, doi: 10.1016/j.comcom.2012.06.004, 2012.

2. H. Lim, I. C. Yoo, Y. Cho, and D. Yook, “Speaker localization in noisy environments using steered response voice power,” IEEE Trans. on Consumer Electronics, Vol.61, No.1, pp. 112-118, doi: 10.1109/TCE.2015.7064118, 2015.

3. A. Sepas-Moghaddam, F. M. Pereira, and P. L. Correia, “Face recognition: a novel multi-level taxonomy based survey,” IET Biometrics, Vol.9, No.2, pp. 58-67, doi: 10.1049/iet-bmt.2019.0001, 2020.

4. J. Qu, H. Shi, N. Qiao, C. Wu, C. Su, and A. Razi, “New three-dimensional positioning algorithm through integrating TDOA and Newton’s method,” EURASIP J. on Wireless Communications and Networking, Article No.77, doi: 10.1186/s13638-020-01684-7, 2020.

5. A. Pourmohammad and S. M. Ahadi, “Real Time High Accuracy 3-D PHAT-Based Sound Source Localization Using a Simple 4-Microphone Arrangement,” IEEE Systems J., Vol.6, No.3, pp. 455-468, doi: 10.1109/JSYST.2011.2176766, 2012.

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Audio-Visual Bimodal Combination-Based Speaker Tracking Method for Mobile Robot;Journal of Advanced Computational Intelligence and Intelligent Informatics;2024-01-20

2. Direction Estimation of Instrumental Sound Sources Using Regression Analysis by Convolutional Neural Network;Circuits, Systems, and Signal Processing;2023-06-28

3. Somatosensory Dance Interaction System Based on AdaBoost Algorithm;Application of Big Data, Blockchain, and Internet of Things for Education Informatization;2023

4. Emotional representation of music in multi-source data by the Internet of Things and deep learning;The Journal of Supercomputing;2022-07-09

5. A Multi-modal Panoramic Speaker Localization Method;2021 China Automation Congress (CAC);2021-10-22