Author:
Liu Xian,Qian Rui,Zhou Hang,Hu Di,Lin Weiyao,Liu Ziwei,Zhou Bolei,Zhou Xiaowei
Abstract
The task of audiovisual sound source localization has been well studied under constrained scenes, where the audio recordings are clean. However, in real world scenarios, audios are usually contaminated by off screen sound and background noise. They will interfere with the procedure of identifying desired sources and building visual sound connections, making previous studies nonapplicable. In this work, we propose the Interference Eraser (IEr) framework, which tackles the problem of audiovisual sound source localization in the wild. The key idea is to eliminate the interference by redefining and carving discriminative audio representations. Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals. We thus extend the audio representation with our Audio Instance Identifier module, which clearly distinguishes sounding instances when audio signals of different volumes are unevenly mixed. Then we erase the influence of the audible but off screen sounds and the silent but visible objects by a Cross modal Referrer module with cross modality distillation. Quantitative and qualitative evaluations demonstrate that our framework achieves superior results on sound localization tasks, especially under real world scenarios.
Publisher
Association for the Advancement of Artificial Intelligence (AAAI)
Cited by
11 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Cross Pseudo-Labeling for Semi-Supervised Audio-Visual Source Localization;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14
2. Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics;Proceedings of the 31st ACM International Conference on Multimedia;2023-10-26
3. An improved TF-GSC for dual-microphone interference suppression in the specific direction;Multimedia Tools and Applications;2023-06-22
4. Egocentric Auditory Attention Localization in Conversations;2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR);2023-06
5. Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection;2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV);2023-01