Combining Swin Transformer and Attention-Weighted Fusion for Scene Text Detection-Reference-Cited by-同舟云学术

Combining Swin Transformer and Attention-Weighted Fusion for Scene Text Detection

Published:2024-02-17 Issue:2 Volume:56 Page:
ISSN:1573-773X
Container-title:Neural Processing Letters
language:en
Short-container-title:Neural Process Lett

Author:

Li Xianguo,Yao Xingchen,Liu Yi

Abstract

AbstractThe existing text detection algorithms based on Convolutional Neural Networks (CNN) commonly have the problems of insufficient receptive fields and inadequate extraction of spatial positional information, which limit their ability to detect large-scale variation text instances, long-distance and wide-spaced text instances as well as effectively distinguish complex background textures. To address the above problems, in this paper, a scene text detection algorithm combining Swin Transformer and attention-weighted fusion is proposed. Firstly, an attention-weighted fusion (AWF) module is proposed, which embeds a modified coordinate attention module (CAM) in the feature pyramid network (FPN). This module learns spatial positional weights of foreground information in different-scale features while suppressing redundant background information. As a result, the fused features are more focused on the text regions, enhancing the localization ability for text regions and boundaries. Secondly, the window-based self-attention mechanism of the Swin Transformer is utilized to achieve global feature perception on the fused features of the pyramid network. This compensates for the insufficient receptive fields of CNN and enhances the representation capability of global contextual features, thereby further improving the performance of text detection. Experimental results demonstrate that the proposed algorithm achieves competitive performance on three public datasets, namely ICDAR2015, MSRA-TD500, and Total-Text, with F-measure reaching 87.9%, 91.4%, and 86.7%, respectively. Code is available at: https://github.com/xgli411/ST-AWFNet.

Funder

Tianjin "Project+Team" Key Training Special Project

Science and Technology Support of Tianjin Key Research and the Development Plan Project

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11063-024-11501-7.pdf

Reference48 articles.

1. Tian Z, Huang W, He T, He P, Qiao Y (2016) Detecting text in natural image with connectionist text proposal network. Computer vision—ECCV 2016. Springer International Publishing, Cham, pp 56–72

2. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

3. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. Computer vision—ECCV 2016. Springer International Publishing, Cham, pp 21–37

4. Liao M, Wan Z, Yao C, Chen K, Bai X (2020) Real-time scene text detection with differentiable binarization. In: Proceedings of the AAAI conference on artificial intelligence, pp 11474–11481

5. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141