Boosting Relationship Detection in Images with Multi-Granular Self-Supervised Learning-Reference-Cited by-同舟云学术

Boosting Relationship Detection in Images with Multi-Granular Self-Supervised Learning

Published:2023-02-17 Issue:2s Volume:19 Page:1-18
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Ding Xuewei¹^ORCID,Pan Yingwei²^ORCID,Li Yehao²^ORCID,Yao Ting²^ORCID,Zeng Dan¹^ORCID,Mei Tao²^ORCID

Affiliation:

1. Shanghai University, Shanghai, China

2. JD AI Research, Beijing, China

Abstract

Visual and spatial relationship detection in images has been a fast-developing research topic in the multimedia field, which learns to recognize the semantic/spatial interactions between objects in an image, aiming to compose a structured semantic understanding of the scene. Most of the existing techniques directly encapsulate the holistic image feature plus the semantic and spatial features of the given two objects for predicting the relationship, but leave the inherent supervision derived from such structured and thorough image understanding under-exploited. Specifically, the inherent supervision among objects or relations within an image can span different granularities in this hierarchy including, from simple to comprehensive, (1) the object-based supervision that captures the interaction between the semantic and spatial features of each individual object, (2) the inter-object supervision that characterizes the dependency within the relationship triplet ( <subject-predicate-object> ), and (3) the inter-relation supervision that exploits contextual information among all relationship triplets in an image. These inherent multi-granular supervisions offer a fertile ground for building self-supervised proxy tasks. In this article, we compose a trilogy of exploring the multi-granular supervision in the sequence from object-based, inter-object, and inter-relation perspectives. We integrate the standard relationship detection objective with a series of proposed self-supervised proxy tasks, which is named as Multi-Granular Self-Supervised learning (MGS). Our MGS is appealing in view that it is pluggable to any neural relationship detection models by simply including the proxy tasks during training, without increasing the computational cost at inference. Through extensive experiments conducted on the SpatialSense and VRD datasets, we demonstrate the superiority of MGS for both spatial and visual relationship detection tasks.

Funder

National Key R&D Program of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3556978

Reference78 articles.

1. Unaiza Ahsan, Rishi Madhok, and Irfan Essa. 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 179–189.

2. VQA: Visual Question Answering

3. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

4. Yi Bin, Yang Yang, Chaofan Tao, Zi Huang, Jingjing Li, and Heng Tao Shen. 2019. MR-NET: Exploiting mutual relation for visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8110–8117.

5. Joint contrastive learning with infinite possibilities;Cai Qi;Advances in Neural Information Processing Systems,2020

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Paired relation feature network for spatial relation recognition;Pattern Recognition Letters;2024-05