Fixation Prediction through Multimodal Analysis-Reference-Cited by-同舟云学术

Fixation Prediction through Multimodal Analysis

Published:2017-01-17 Issue:1 Volume:13 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Min Xiongkuo¹,Zhai Guangtao¹,Gu Ke¹,Yang Xiaokang¹

Affiliation:

1. Shanghai Jiao Tong University, Shanghai, China

Abstract

In this article, we propose to predict human eye fixation through incorporating both audio and visual cues. Traditional visual attention models generally make the utmost of stimuli’s visual features, yet they bypass all audio information. In the real world, however, we not only direct our gaze according to visual saliency, but also are attracted by salient audio cues. Psychological experiments show that audio has an influence on visual attention, and subjects tend to be attracted by the sound sources. Therefore, we propose fusing both audio and visual information to predict eye fixation. In our proposed framework, we first localize the moving--sound-generating objects through multimodal analysis and generate an audio attention map. Then, we calculate the spatial and temporal attention maps using the visual modality. Finally, the audio, spatial, and temporal attention maps are fused to generate the final audiovisual saliency map. The proposed method is applicable to scenes containing moving--sound-generating objects. We gather a set of video sequences and collect eye-tracking data under an audiovisual test condition. Experiment results show that we can achieve better eye fixation prediction performance when taking both audio and visual cues into consideration, especially in some typical scenes in which object motion and audio are highly correlated.

Funder

National High-Tech R8D Program of China

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/2996463

Reference53 articles.

1. Frequency-tuned salient region detection

2. Measuring the Objectness of Image Windows

3. MoVi

4. Harmony in Motion

5. Ali Borji Ming-Ming Cheng Huaizu Jiang and Jia Li. 2014. Salient object detection: A survey. ArXiv Preprint. Ali Borji Ming-Ming Cheng Huaizu Jiang and Jia Li. 2014. Salient object detection: A survey. ArXiv Preprint.

Cited by 72 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. CoSTA: Co-training spatial–temporal attention for blind video quality assessment;Expert Systems with Applications;2024-12

2. A dense video caption dataset of student classroom behaviors and a baseline model with boundary semantic awareness;Displays;2024-09

3. ADS-VQA: Adaptive sampling model for video quality assessment;Displays;2024-09

4. AFNet: Asymmetric fusion network for monocular panorama depth estimation;Displays;2024-09

5. Multi-stage coarse-to-fine progressive enhancement network for single-image HDR reconstruction;Displays;2024-09