Rethinking Multilingual Scene Text Spotting: A Novel Benchmark and a Character-Level Feature Based Approach-Reference-Cited by-同舟云学术

Rethinking Multilingual Scene Text Spotting: A Novel Benchmark and a Character-Level Feature Based Approach

Published:2024-09-06 Issue:3 Volume:7 Page:71-81
ISSN:2640-012X
Container-title:American Journal of Computer Science and Technology
language:en
Short-container-title:AJCST

Author:

Ma Siliang¹^ORCID,Xu Yong²

Affiliation:

1. School of Computer Science and Engineering, South China University of Technology, Guangzhou, China

2. School of Computer Science and Engineering, South China University of Technology, Guangzhou, China; Pengcheng Laboratory, Shenzhen, China

Abstract

End-to-end multilingual scene text spotting aims to integrate scene text detection and recognition into a unified framework. Actually, the accuracy of text recognition largely depends on the accuracy of text detection. Due to the lackage of benchmarks with adequate and high-quality character-level annotations for multilingual scene text spotting, most of the existing methods train on the benchmarks only with word-level annotations. However, the performance of multilingual scene text spotting are not that satisfied training on the existing benchmarks, especially for those images with special layout or words out of vocabulary. In this paper, we proposed a simple YOLO-like baseline named CMSTR for character-level multilingual scene text spotting simultaneously and efficiently. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Furthermore, we show the surprisingly good extensibility of our method, in terms of character class, language type, and task. On the one hand, DeepSolo not only performs well in English scenes but also masters the Chinese transcription with complex font structure and a thousand-level character classes. On the other hand, based on the extensibility of DeepSolo, we launch DeepSolo++ for multilingual text spotting, making a further step to let Transformer decoder with explicit points solo for multilingual text detection, recognition, and script identification all at once.

Publisher

Science Publishing Group

Link

https://article.sciencepublishinggroup.com/pdf/j.ajcst.20240703.12

Reference35 articles.

1. Baek Y, Shin S, Baek J, et al. Character region attention for text spotting [C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16. Springer International Publishing, 2020: 504-521. https://doi.org/10.1007/978-3-030-58526-6_30

2. Bochkovskiy A. Yolov4: Optimal speed and accuracy of object detection [J]. arxiv preprint arxiv:2004.10934, 2020.

3. Bušta M, Patel Y, Matas J. E2e-mlt-an unconstrained end-to-end method for multi-language scene text [C]// Computer Vision–ACCV 2018 Workshops: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers 14. Springer International Publishing, 2019: 127-143. https://doi.org/10.1007/978-3-030-21074-8_11

4. Ch’ng C K, Chan C S, Liu C L. Total-text: toward orientation robustness in scene text detection [J]. International Journal on Document Analysis and Recognition (IJDAR), 2020, 23(1): 31-52. https://doi.org/10.1007/s10032-019-00334-z

5. Yao C, Bai X, Liu W, et al. Detecting texts of arbitrary orientations in natural images [C]//2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012: 1083-1090. https://doi.org/10.1109/CVPR.2012.6247787