ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition
-
Published:2023-12-13
Issue:12
Volume:9
Page:276
-
ISSN:2313-433X
-
Container-title:Journal of Imaging
-
language:en
-
Short-container-title:J. Imaging
Author:
Buoy Rina1ORCID, Iwamura Masakazu1ORCID, Srun Sovila2, Kise Koichi1ORCID
Affiliation:
1. Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Osaka 599-8531, Japan 2. Department of Information Technology Engineering, Faculty of Engineering, Royal University of Phnom Penh, Phnom Penh 12156, Cambodia
Abstract
Attention-based encoder–decoder scene text recognition (STR) architectures have been proven effective in recognizing text in the real world, thanks to their ability to learn an internal language model. Nevertheless, the cross-attention operation that is used to align visual and linguistic features during decoding is computationally expensive, especially in low-resource environments. To address this bottleneck, we propose a cross-attention-free STR framework that still learns a language model. The framework we propose is ViTSTR-Transducer, which draws inspiration from ViTSTR, a vision transformer (ViT)-based method designed for STR and the recurrent neural network transducer (RNN-T) initially introduced for speech recognition. The experimental results show that our ViTSTR-Transducer models outperform the baseline attention-based models in terms of the required decoding floating point operations (FLOPs) and latency while achieving a comparable level of recognition accuracy. Compared with the baseline context-free ViTSTR models, our proposed models achieve superior recognition accuracy. Furthermore, compared with the recent state-of-the-art (SOTA) methods, our proposed models deliver competitive results.
Funder
JSPS Kakenhi RUPP-OMU/HEIP
Subject
Electrical and Electronic Engineering,Computer Graphics and Computer-Aided Design,Computer Vision and Pattern Recognition,Radiology, Nuclear Medicine and imaging
Reference55 articles.
1. Text recognition in the wild;Chen;ACM Comput. Surv.,2021 2. Wang, J., and Hu, X. (2017, January 4–9). Gated Recurrent Convolution Neural Network for OCR. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA. 3. Borisyuk, F., Gordo, A., and Sivakumar, V. (2018, January 19–23). Rosetta: Large scale system for text detection and recognition in images. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK. 4. Shi, B., Wang, X., Lyu, P., Yao, C., and Bai, X. (July, January 26). Robust scene text recognition with automatic rectification. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA. 5. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition;Shi;IEEE Trans. Pattern Anal. Mach. Intell.,2017
|
|