Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction


Bellitto G.,Proietto Salanitri F.,Palazzo S.ORCID,Rundo F.,Giordano D.,Spampinato C.


AbstractIn this work, we propose a 3D fully convolutional architecture for video saliency prediction that employs hierarchical supervision on intermediate maps (referred to as conspicuity maps) generated using features extracted at different abstraction levels. We provide the base hierarchical learning mechanism with two techniques for domain adaptation and domain-specific learning. For the former, we encourage the model to unsupervisedly learn hierarchical general features using gradient reversal at multiple scales, to enhance generalization capabilities on datasets for which no annotations are provided during training. As for domain specialization, we employ domain-specific operations (namely, priors, smoothing and batch normalization) by specializing the learned features on individual datasets in order to maximize performance. The results of our experiments show that the proposed model yields state-of-the-art accuracy on supervised saliency prediction. When the base hierarchical model is empowered with domain-specific modules, performance improves, outperforming state-of-the-art models on three out of five metrics on the DHF1K benchmark and reaching the second-best results on the other two. When, instead, we test it in an unsupervised domain adaptation setting, by enabling hierarchical gradient reversal layers, we obtain performance comparable to supervised state-of-the-art. Source code, trained models and example outputs are publicly available at https://github.com/perceivelab/hd2s.


Università degli Studi di Catania


Springer Science and Business Media LLC


Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Reference74 articles.

1. Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoderdecoder architecture for image segmentation. IEEE TPAMI, 39(12), 2481–2495.

2. Bak, C., et al. (2017). Spatio-temporal saliency networks for dynamic saliency prediction. IEEE TMM, 20(7), 1688–1698.

3. Bazzani, L., Larochelle, H., Torresani L. (2016). Recurrent mixture density network for spatiotemporal visual attention . In: arXiv preprint arXiv:1603.08199 (2016).

4. Borji, A., Itti, L. (2015). Cat2000: A large scale fixation dataset for boosting saliency research . In: arXiv preprint arXiv:1505.03581

5. Bylinskii, Z., et al. (2018). What do different evaluation metrics tell us about saliency models? IEEE TPAMI, 41(3), 740–757.

Cited by 25 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Joint Learning of Audio–Visual Saliency Prediction and Sound Source Localization on Multi-face Videos;International Journal of Computer Vision;2023-12-29

2. Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction;IEEE Transactions on Circuits and Systems for Video Technology;2023-12

3. NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos;Proceedings of the 31st ACM International Conference on Multimedia;2023-10-26

4. In the Eye of Transformer: Global–Local Correlation for Egocentric Gaze Estimation and Beyond;International Journal of Computer Vision;2023-10-18

5. Spatio-Temporal Feature Pyramid Interactive Attention Network for Egocentric Gaze Prediction;IEEE Transactions on Circuits and Systems for Video Technology;2023-10








Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3