Multimodal Visual-Semantic Representations Learning for Scene Text Recognition-Reference-Cited by-同舟云学术

Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

Published:2024-03-27 Issue:7 Volume:20 Page:1-18
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Gao Xinjian¹^ORCID,Pang Ye²^ORCID,Liu Yuyu²^ORCID,Han Maokun²^ORCID,Yu Jun¹^ORCID,Wang Wei²^ORCID,Chen Yuanxu²^ORCID

Affiliation:

1. University of Science and Technology of China, Hefei, China

2. Ping An Technology Co., Ltd, Beijing, China

Abstract

Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.

Funder

Natural Science Foundation of China

National Aviation Science Foundation

CAAI-Huawei MindSpore Open Fund

Anhui Province Key Research and Development Program

Dreams Foundation of Jianghuai Advance Technology Center

USTC-IAT Application Sci. & Tech. Achievement Cultivation Program

Sci. & Tech. Innovation Special Zone

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3646551

Reference60 articles.

1. Multimodal semi-supervised learning for text recognition;Aberdam Aviad;arXiv preprint arXiv:2205.03873,2022

2. Vision Transformer for Fast and Efficient Scene Text Recognition

3. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

4. BEiT: Bert pre-training of image transformers;Bao Hangbo;arXiv preprint arXiv:2106.08254,2021

5. Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition