PAV-SOD: A New Task towards Panoramic Audiovisual Saliency Detection-Reference-Cited by-同舟云学术

PAV-SOD: A New Task towards Panoramic Audiovisual Saliency Detection

Published:2023-02-25 Issue:3 Volume:19 Page:1-26
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Zhang Yi¹^ORCID,Chao Fang-Yi²^ORCID,Hamidouche Wassim³^ORCID,Deforges Olivier³^ORCID

Affiliation:

1. Univ Rennes, INSA Rennes, CNRS, IETR (UMR 6164), France

2. Trinity College Dublin, Ireland

3. Univ Rennes, INSA Rennes, CNRS,IETR (UMR 6164), France

Abstract

Object-level audiovisual saliency detection in 360° panoramic real-life dynamic scenes is important for exploring and modeling human perception in immersive environments, also for aiding the development of virtual, augmented, and mixed reality applications in fields such as education, social network, entertainment, and training. To this end, we propose a new task, p anoramic a udio v isual s alient o bject d etection, ( PAV-SOD 1 ), which aims to segment the objects grasping most of the human attention in 360° panoramic videos reflecting real-life daily scenes. To support the task, we collect PAVS10K , the first p anoramic video dataset for a udio v isual s alient object detection, which consists of 67 4K-resolution equirectangular videos with per-video labels including hierarchical scene categories and associated attributes depicting specific challenges for conducting PAV-SOD , and 10,465 uniformly sampled video frames with manually annotated object-level and instance-level pixel-wise masks. The coarse-to-fine annotations enable multi-perspective analysis regarding PAV-SOD modeling. We further systematically benchmark 13 state-of-the-art salient object detection (SOD)/video object segmentation (VOS) methods based on our PAVS10K . Besides, we propose a new baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE). Our C VAE-based a udio v isual net work, namely, CAV-Net , consists of a spatial-temporal visual segmentation network, a convolutional audio-encoding network, and audiovisual distribution estimation modules. As a result, our CAV-Net outperforms all competing models and is able to estimate the aleatoric uncertainties within PAVS10K . With extensive experimental results, we gain several findings about PAV-SOD challenges and insights towards PAV-SOD model interpretability. We hope that our work could serve as a starting point for advancing SOD towards immersive media.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3565267

Reference145 articles.

1. A Dataset of Head and Eye Movements for 360 Degree Images

2. Saliency in VR: How Do People Explore Virtual Environments?

3. 360-Degree Video Head Movement Dataset

4. Predicting head movement in panoramic video: A deep reinforcement learning approach;Xu Mai;IEEE Trans. Neural Netw. Learn. Syst.,2018

5. Ziheng Zhang, Yanyu Xu, Jingyi Yu, and Shenghua Gao. 2018. Saliency detection in 360 videos. In Proceedings of the European Conference on Computer Vision (ECCV). 488–503.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. D-SAV360: A Dataset of Gaze Scanpaths on 360° Ambisonic Videos;IEEE Transactions on Visualization and Computer Graphics;2023-11

2. A Survey on 360° Images and Videos in Mixed Reality: Algorithms and Applications;Journal of Computer Science and Technology;2023-05-30