Author:
Lee Donghyeon,Kim Joonyoung,Jung Kyomin
Abstract
Fully convolutional structures provide feature maps acquiring local contexts of an image by only stacking numerous convolutional layers. These structures are known to be effective in modern state-of-the-art object detectors such as Faster R-CNN and SSD to find objects from local contexts. However, the quality of object detectors can be further improved by incorporating global contexts when some ambiguous objects should be identified by surrounding objects or background. In this paper, we introduce a self-attention module for object detectors to incorporate global contexts. More specifically, our self-attention module allows the feature extractor to compute feature maps with global contexts by the self-attention mechanism. Our self-attention module computes relationships among all elements in the feature maps, and then blends the feature maps considering the computed relationships. Therefore, this module can capture long-range relationships among objects or backgrounds, which is difficult for fully convolutional structures. Furthermore, our proposed module is not limited to any specific object detectors, and it can be applied to any CNN-based model for any computer vision task. In the experimental results on the object detection task, our method shows remarkable gains in average precision (AP) compared to popular models that have fully convolutional structures. In particular, compared to Faster R-CNN with the ResNet-50 backbone, our module applied to the same backbone achieved +4.0 AP gains without the bells and whistles. In image semantic segmentation and panoptic segmentation tasks, our module improved the performance in all metrics used for each task.
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献