Affiliation:
1. Macau University of Science and Technology
2. Guangdong Bohua Ultra HD Innovation Center Co., Ltd.
Abstract
Abstract
Audio-visual event localization (AVEL) is a task that utilizes audio and visual features in videos to perceive the correlation between audio and video and the type of event. Previous methods have mostly focused on aligning the two features in temporal sequence, ignoring the high-order feature expression after audio-visual feature fusion and the role of cross-attention. To address this issue, we propose a bimodal feature cross-concatenation fusion network (BiCCF Net) that aligns visual and audio features in latent space using the spatiotemporal correlation (STC) module. And the audio-visual cross attention (AVCA) module is used to extract cross-attention while using the Factorized Bilinear Coding (FBC) based Audio-Visual Fusion (AVF) Module to obtain the fused high-order feature expression. Finally, the fused features are combined with cross-attention and processed by a background suppression classification module to predict the category of events and the correlation between audio and video features. Our experiments were conducted on the AVE dataset, and we achieved significant improvement compared to baseline models.
Publisher
Research Square Platform LLC