Audio–Visual Fusion Based on Interactive Attention for Person Verification-Reference-Cited by-同舟云学术

Audio–Visual Fusion Based on Interactive Attention for Person Verification

Published:2023-12-15 Issue:24 Volume:23 Page:9845
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Jing Xuebin¹²,He Liang¹²³,Song Zhida¹²,Wang Shaolei¹²

Affiliation:

1. School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

2. Xinjiang Key Laboratory of Signal Detection and Processing, Urumqi 830017, China

3. Department of Electronic Engineering, and Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China

Abstract

With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio–visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verification systems and how to combine audio and visual features. In this study, we use pretrained models to extract the embeddings from each modality and then perform fusion model experiments based on these embeddings. The baseline approach in this paper involves taking the fusion feature and passing it through a fully connected (FC) layer. Building upon this baseline, we propose three fusion models based on attentional mechanisms: attention, gated, and inter–attention. These fusion models are trained on the VoxCeleb1 development set and tested on the evaluation sets of the VoxCeleb1, NIST SRE19, and CNC-AV datasets. On the VoxCeleb1 dataset, the best system performance achieved in this study was an equal error rate (EER) of 0.23% and a detection cost function (minDCF) of 0.011. On the evaluation set of NIST SRE19, the EER was 2.60% and the minDCF was 0.283. On the evaluation set of the CNC-AV set, the EER was 11.30% and the minDCF was 0.443. These experimental results strongly demonstrate that the proposed fusion method can significantly improve the performance of multimodal character verification systems.

Funder

National Key R&D Program of China

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/24/9845/pdf

Reference51 articles.

1. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018, January 15–20). X-Vectors: Robust DNN Embeddings for Speaker Recognition. Proceedings of the ICASSP 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AL, Canada.

2. Phoneme recognition using time-delay neural networks;Waibel;Readings Speech Recognit.,1990

3. Desplanques, B., Thienpondt, J., and Demuynck, K. (2020, January 25–29). ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proceedings of the Interspeech 2020, Shanghai, China.

4. Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019, January 15–20). ArcFace: Additive Angular Margin Loss for Deep Face Recognition. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.

5. Zhang, C., and Koishida, K. (2017, January 20–24). End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances. Proceedings of the Interspeech 2017, Stockholm, Sweden.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. One Model to Rule Them all: A Universal Transformer for Biometric Matching;IEEE Access;2024