Affiliation:
1. Univ Rennes, INSA Rennes, CNRS, IETR (UMR 6164), France
2. Trinity College Dublin, Ireland
3. Univ Rennes, INSA Rennes, CNRS,IETR (UMR 6164), France
Abstract
Object-level audiovisual saliency detection in 360° panoramic real-life dynamic scenes is important for exploring and modeling human perception in immersive environments, also for aiding the development of virtual, augmented, and mixed reality applications in fields such as education, social network, entertainment, and training. To this end, we propose a new task,
p
anoramic
a
udio
v
isual
s
alient
o
bject
d
etection, (
PAV-SOD
1
), which aims to segment the objects grasping most of the human attention in 360° panoramic videos reflecting real-life daily scenes. To support the task, we collect
PAVS10K
, the first
p
anoramic video dataset for
a
udio
v
isual
s
alient object detection, which consists of 67 4K-resolution equirectangular videos with per-video labels including hierarchical scene categories and associated attributes depicting specific challenges for conducting
PAV-SOD
, and 10,465 uniformly sampled video frames with manually annotated object-level and instance-level pixel-wise masks. The coarse-to-fine annotations enable multi-perspective analysis regarding
PAV-SOD
modeling. We further systematically benchmark 13 state-of-the-art salient object detection (SOD)/video object segmentation (VOS) methods based on our
PAVS10K
. Besides, we propose a new baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE). Our
C
VAE-based
a
udio
v
isual
net
work, namely,
CAV-Net
, consists of a spatial-temporal visual segmentation network, a convolutional audio-encoding network, and audiovisual distribution estimation modules. As a result, our
CAV-Net
outperforms all competing models and is able to estimate the aleatoric uncertainties within
PAVS10K
. With extensive experimental results, we gain several findings about
PAV-SOD
challenges and insights towards
PAV-SOD
model interpretability. We hope that our work could serve as a starting point for advancing SOD towards immersive media.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Reference145 articles.
1. A Dataset of Head and Eye Movements for 360 Degree Images
2. Saliency in VR: How Do People Explore Virtual Environments?
3. 360-Degree Video Head Movement Dataset
4. Predicting head movement in panoramic video: A deep reinforcement learning approach;Xu Mai;IEEE Trans. Neural Netw. Learn. Syst.,2018
5. Ziheng Zhang, Yanyu Xu, Jingyi Yu, and Shenghua Gao. 2018. Saliency detection in 360 videos. In Proceedings of the European Conference on Computer Vision (ECCV). 488–503.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献