A closer look at referring expressions for video object segmentation-Reference-Cited by-同舟云学术

A closer look at referring expressions for video object segmentation

Published:2022-07-27 Issue:3 Volume:82 Page:4419-4438
ISSN:1380-7501
Container-title:Multimedia Tools and Applications
language:en
Short-container-title:Multimed Tools Appl

Author:

Bellver Miriam,Ventura Carles^ORCID,Silberer Carina,Kazakos Ioannis,Torres Jordi,Giro-i-Nieto Xavier

Abstract

AbstractThe task of Language-guided Video Object Segmentation (LVOS) aims at generating binary masks for an object referred by a linguistic expression. When this expression unambiguously describes an object in the scene, it is named referring expression (RE). Our work argues that existing benchmarks used for LVOS are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the referring expressions in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, where the non-trivial REs are further annotated with seven RE semantic categories. We leverage these data to analyze the performance of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for LVOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.

Funder

Ministerio de Ciencia, Innovación y Universidades

Ministerio de Economía y Competitividad

Departament d’Universitats, Recerca i Societat de la Informació

Universitat Oberta de Catalunya

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Hardware and Architecture,Media Technology,Software

Link

https://link.springer.com/content/pdf/10.1007/s11042-022-13413-x.pdf

Reference52 articles.

1. Anayurt H, Ozyegin SA, Cetin U, Aktas U, Kalkan S (2019) Searching for ambiguous objects in videos using relational referring expressions. In: Proceedings of the british machine vision conference (BMVC)

2. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision, pp 213–229. Springer

3. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

4. Chen DJ, Jia S, Lo YC, Chen HT, Liu TL (2019) See-through-text grouping for referring image segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 7454–7463

5. Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improving visual grounding with multi-modal interaction and auto-regressive vertex generation;Neurocomputing;2024-09

2. Temporal Context Enhanced Referring Video Object Segmentation;2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV);2024-01-03

3. Adversarial Attacks on Video Object Segmentation with Hard Region Discovery;IEEE Transactions on Circuits and Systems for Video Technology;2024

4. Self-supervised Meta Auxiliary Learning for Actor and Action Video Segmentation from Natural Language;Lecture Notes in Computer Science;2024

5. Improving Visual Grounding with Multi-Modal Interaction and Auto-Regressive Vertex Generation;2024