Affiliation:
1. University of Science and Technology of China, Hefei, China
2. Ping An Technology Co., Ltd, Beijing, China
Abstract
Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.
Funder
Natural Science Foundation of China
National Aviation Science Foundation
CAAI-Huawei MindSpore Open Fund
Anhui Province Key Research and Development Program
Dreams Foundation of Jianghuai Advance Technology Center
USTC-IAT Application Sci. & Tech. Achievement Cultivation Program
Sci. & Tech. Innovation Special Zone
Publisher
Association for Computing Machinery (ACM)