AS-Net: active speaker detection using deep audio-visual attention-Reference-Cited by-同舟云学术

AS-Net: active speaker detection using deep audio-visual attention

Published:2024-02-05 Issue:28 Volume:83 Page:72027-72042
ISSN:1573-7721
Container-title:Multimedia Tools and Applications
language:en
Short-container-title:Multimed Tools Appl

Author:

Radman Abduljalil^ORCID,Laaksonen Jorma

Abstract

AbstractActive Speaker Detection (ASD) aims at identifying the active speaker among multiple speakers in a video scene. Previous ASD models often seek audio and visual features from long video clips with a complex 3D Convolutional Neural Network (CNN) architecture. However, models based on 3D CNNs can generate discriminative spatial-temporal features, but this comes at the expense of computational complexity, and they frequently face challenges in detecting active speakers in short video clips. This work proposes the Active Speaker Network (AS-Net) model, a simple yet effective ASD method tailored for detecting active speakers in relatively short video clips without relying on 3D CNNs. Instead, it incorporates the Temporal Shift Module (TSM) into 2D CNNs, facilitating the extraction of dense temporal visual features without the need for additional computations. Moreover, self-attention and cross-attention schemes are introduced to enhance long-term temporal audio-visual synchronization, thereby improving ASD performance. Experimental results demonstrate that AS-Net outperforms state-of-the-art 2D CNN-based methods on the AVA-ActiveSpeaker dataset and remains competitive with the methods utilizing more complex architectures.

Funder

Aalto University

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11042-024-18457-9.pdf

Reference47 articles.

1. Chung S-W, Kang HG, Chung JS (2020) Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision. In: INTERSPEECH. pp 3486–3490

2. Qian X, Brutti A, Lanz O, Omologo M, Cavallaro A (2021) Audio-visual tracking of concurrent speakers. IEEE Trans Multimed 24:942–954

3. Pibre L, Madrigal F, Equoy C, Lerasle F, Pellegrini T, Pinquier J, Ferrané I (2023) Audio-video fusion strategies for active speaker detection in meetings. Multimed Tools Appl 82(9):13667–13688

4. Wang Q, Downey C, Wan L, Mansfield PA, Moreno IL (2018) Speaker diarization with LSTM. In: ICASSP. pp 5239–5243

5. Cabañas-Molero P, Lucena M, Fuertes JM, Vera-Candeas P, Ruiz-Reyes N (2018) Multimodal speaker diarization for meetings using volume-evaluated SRP-PHAT and video analysis. Multimed Tools Appl 77:27685–27707