Mitigating Distractor Challenges in Video Object Segmentation through Shape and Motion Cues
-
Published:2024-02-28
Issue:5
Volume:14
Page:2002
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Peng Jidong1, Zhao Yibing2, Zhang Dingwei2, Chen Yadang2ORCID
Affiliation:
1. Nanjing Research Institute of Electronic Engineering, Nanjing 210001, China 2. School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China
Abstract
The purpose of semi-supervised video object segmentation (VOS) is to predict and generate object masks in subsequent video frames after being provided with the initial frame’s object mask. Currently, mainstream methods leverage historical frame information for enhancing the network’s performance. However, this approach faces the following issues: (1) They often overlook important shape information, leading to decreased accuracy in segmenting object-edge areas. (2) They often use pixel-level motion estimation to guide the matching for addressing distractor objects. However, this brings heavy computation costs and struggle against occlusion or fast/blurry motion. For the first problem, this paper introduces an object shape extraction module that exploits both the high-level and low-level features to obtain object shape information, by which the shape information can be used to further refine the predicted masks. For the second problem, this paper introduces a novel object-level motion prediction module, in which it stores the representative motion features during the training stage, and predicts the object motion by retrieving them during the inference stage. We evaluate our method on benchmark datasets compared with recent state-of-the-art methods, and the results demonstrate the effectiveness of the proposed method.
Funder
National Natural Science Foundation of China
Reference48 articles.
1. Oh, S.W., Lee, J.Y., Xu, N., and Kim, S.J. (November, January 27). Video object segmentation using space-time memory networks. Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea. 2. Rethinking space-time networks with improved memory coverage for efficient video object segmentation;Cheng;Adv. Neural Inf. Process. Syst.,2021 3. Yang, Z., Wei, Y., and Yang, Y. (2020, January 23–28). Collaborative video object segmentation by foreground-background integration. Proceedings of the European Conference on Computer Vision, Glasgow, UK. 4. Collaborative video object segmentation by multi-scale foreground-background integration;Yang;IEEE Trans. Pattern Anal. Mach. Intell.,2021 5. Li, M., Hu, L., Xiong, Z., Zhang, B., Pan, P., and Liu, D. (2022, January 18–24). Recurrent Dynamic Embedding for Video Object Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.
|
|