Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion-Reference-Cited by-同舟云学术

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Published:2022-03-03 Issue:3 Volume:13 Page:1-20
ISSN:2157-6904
Container-title:ACM Transactions on Intelligent Systems and Technology
language:en
Short-container-title:ACM Trans. Intell. Syst. Technol.

Author:

Wei Lili¹,Lang Congyan¹^ORCID,Liang Liqian¹,Feng Songhe¹^ORCID,Wang Tao¹,Chen Shidi¹

Affiliation:

1. the Beijing Key Laboratory of Traffic Data Analysis and Mining, School of Computerand Information Technology, Beijing Jiaotong University, Beijing, China

Abstract

Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.

Funder

Beijing Natural Science Foundation

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3506716

Reference67 articles.

1. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

2. High Accuracy Optical Flow Estimation Based on a Theory for Warping