Black-box Attack against Self-supervised Video Object Segmentation Models with Contrastive Loss-Reference-Cited by-同舟云学术

Black-box Attack against Self-supervised Video Object Segmentation Models with Contrastive Loss

Published:2023-10-18 Issue:2 Volume:20 Page:1-21
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Chen Ying¹^ORCID,Yao Rui¹^ORCID,Zhou Yong¹^ORCID,Zhao Jiaqi¹^ORCID,Liu Bing¹^ORCID,Saddik Abdulmotaleb El²^ORCID

Affiliation:

1. School of Computer Sciences and Technology, China University of Mining and Technology, and Engineering Research Center of Mine Digitization of Ministry of Education of the Peoples Republic of China, China

2. School of Electrical Engineering and Computer Science, Multimedia Communications Research Laboratory, University of Ottawa, Canada

Abstract

Deep learning models have been proven to be susceptible to malicious adversarial attacks, which manipulate input images to deceive the model into making erroneous decisions. Consequently, the threat posed to these models serves as a poignant reminder of the necessity to focus on the model security of object segmentation algorithms based on deep learning. However, the current landscape of research on adversarial attacks primarily centers around static images, resulting in a dearth of studies on adversarial attacks targeting Video Object Segmentation (VOS) models. Given that a majority of self-supervised VOS models rely on affinity matrices to learn feature representations of video sequences and achieve robust pixel correspondence, our investigation has delved into the impact of adversarial attacks on self-supervised VOS models. In response, we propose an innovative black-box attack method incorporating contrastive loss. This method induces segmentation errors in the model through perturbations in the feature space and the application of a pixel-level loss function. Diverging from conventional gradient-based attack techniques, we adopt an iterative black-box attack strategy that incorporates contrastive loss across the current frame, any two consecutive frames, and multiple frames. Through extensive experimentation conducted on the DAVIS 2016 and DAVIS 2017 datasets using three self-supervised VOS models and one unsupervised VOS model, we unequivocally demonstrate the potent attack efficiency of the black-box approach. Remarkably, the J&F metric value experiences a significant decline of up to 50.08% post-attack.

Funder

National Natural Science Foundation of China

Xuzhou Key Research and Development Program

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3617502

Reference63 articles.

1. Learning to See by Moving

2. S. Caelles, K. -K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. 2017. One-shot video object segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5320–5329.

3. Shixing Chen, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, and Raffay Hamid. 2021. Shot contrastive self-supervised learning for scene boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9796–9805.

4. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. PMLR, 1597–1607.

5. Zedu Chen Bineng Zhong Guorong Li Shengping Zhang Rongrong Ji Zhenjun Tang and Xianxian Li. 2022. SiamBAN: Target-aware tracking with siamese box adaptive network. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 4 (2023) 5158–5173.