TRACL: Temporal reconstruction and adaptive consistency loss for semi‐supervised video semantic segmentation-Reference-Cited by-同舟云学术

TRACL: Temporal reconstruction and adaptive consistency loss for semi‐supervised video semantic segmentation

Published:2023-10-11 Issue:2 Volume:18 Page:348-361
ISSN:1751-9659
Container-title:IET Image Processing
language:en
Short-container-title:IET Image Processing

Author:

Liang Zhixue¹²^ORCID,Dong Wenyong¹³,Zhang Bo¹

Affiliation:

1. School of Computer Science Wuhan University Wuhan China

2. School of Computer and Software Nanyang Institute of Technology Nanyang China

3. School of Information Network Security Xinjiang University of Political Science and Law Tumushuke China

Abstract

AbstractWhile existing supervised semantic segmentation methods have shown significant performance improvements, they heavily rely on large‐scale pixel‐level annotated data. To reduce this dependence, recent research has proposed semi‐supervised learning‐based methods that have achieved great success. However, almost all these works are mainly dedicated to image semantic segmentation, while semi‐supervised video semantic segmentation (SVSS) has been barely explored. Due to the significant difference between video data and image, simply adapting semi‐supervised image semantic segmentation approaches to SVSS may neglect the inherent temporal correlations in video frames. This paper presents a novel method (named TRACL) with temporal reconstruction (TR) and adaptive consistency loss (ACL) for SVSS, aiming to fully utilize the temporal relations of internal frames in video clip. The authors’ TR method implements the reconstruction from the feature and output levels to narrow the distribution gap between internal video frames. Specifically, considering the underlying data distribution, the authors construct Gaussian models for each category, and use probability density function to obtain the similarity between different feature maps for temporal feature reconstruction. The authors’ ACL can adaptively select two pixel‐wise consistency loss including Flow Consistency Loss and Reconstruction Consistency Loss, providing stronger supervision signals for unlabelled frames during model training. Additionally, the authors extend their method to unlabelled video for more training data by employing mean‐teacher structure. Extensive experiments on three datasets including Cityscapes, Camvid and VSPW demonstrate that the authors’ proposed method outperforms previous state‐of‐the‐art methods.

Publisher

Institution of Engineering and Technology (IET)

Subject

Electrical and Electronic Engineering,Computer Vision and Pattern Recognition,Signal Processing,Software

Reference65 articles.

1. Long J. Shelhamer E. Darrell T.:Fully convolutional networks for semantic segmentation. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 3431–3440(2015)

2. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

3. Zhao H. Shi J. Qi X. Wang X. Jia J.:Pyramid scene parsing network. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 2881–2890(2017)

4. Wu S. Wu T. Lin F. Tian S. Guodong Guo: Fully transformer networks for semantic image segmentation. arXiv preprint arXiv:2106.04108 (2021)

5. Segformer: Simple and efficient design for semantic segmentation with transformers;Xie E.;Adv. Neural Inf. Process. Syst.,2021