Affiliation:
1. Shanghai University, Shanghai, China
2. JD AI Research, Beijing, China
Abstract
Visual and spatial relationship detection in images has been a fast-developing research topic in the multimedia field, which learns to recognize the semantic/spatial interactions between objects in an image, aiming to compose a structured semantic understanding of the scene. Most of the existing techniques directly encapsulate the holistic image feature plus the semantic and spatial features of the given two objects for predicting the relationship, but leave the inherent supervision derived from such structured and thorough image understanding under-exploited. Specifically, the inherent supervision among objects or relations within an image can span different granularities in this hierarchy including, from simple to comprehensive, (1) the
object-based
supervision that captures the interaction between the semantic and spatial features of each individual object, (2) the
inter-object
supervision that characterizes the dependency within the relationship triplet (
<subject-predicate-object>
), and (3) the
inter-relation
supervision that exploits contextual information among all relationship triplets in an image. These inherent multi-granular supervisions offer a fertile ground for building self-supervised proxy tasks. In this article, we compose a trilogy of exploring the multi-granular supervision in the sequence from object-based, inter-object, and inter-relation perspectives. We integrate the standard relationship detection objective with a series of proposed self-supervised proxy tasks, which is named as Multi-Granular Self-Supervised learning (MGS). Our MGS is appealing in view that it is pluggable to any neural relationship detection models by simply including the proxy tasks during training, without increasing the computational cost at inference. Through extensive experiments conducted on the SpatialSense and VRD datasets, we demonstrate the superiority of MGS for both spatial and visual relationship detection tasks.
Funder
National Key R&D Program of China
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Reference78 articles.
1. Unaiza Ahsan, Rishi Madhok, and Irfan Essa. 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV’19). IEEE, 179–189.
2. VQA: Visual Question Answering
3. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection
4. Yi Bin, Yang Yang, Chaofan Tao, Zi Huang, Jingjing Li, and Heng Tao Shen. 2019. MR-NET: Exploiting mutual relation for visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8110–8117.
5. Joint contrastive learning with infinite possibilities;Cai Qi;Advances in Neural Information Processing Systems,2020
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献