Multimodal Visual-Semantic Representations Learning for Scene Text Recognition

Author:

Gao Xinjian1ORCID,Pang Ye2ORCID,Liu Yuyu2ORCID,Han Maokun2ORCID,Yu Jun1ORCID,Wang Wei2ORCID,Chen Yuanxu2ORCID

Affiliation:

1. University of Science and Technology of China, Hefei, China

2. Ping An Technology Co., Ltd, Beijing, China

Abstract

Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations in different modalities. Thus, LM-based methods can hardly generalize well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, and so on. To migrate the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition. Specifically, our MVSTRN builds a bridge between vision and language through its unified architecture and has the ability to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation, breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated to combine the multimodal visual and textual semantics from VM and LM to make the final predictions. Extensive experiments demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.

Funder

Natural Science Foundation of China

National Aviation Science Foundation

CAAI-Huawei MindSpore Open Fund

Anhui Province Key Research and Development Program

Dreams Foundation of Jianghuai Advance Technology Center

USTC-IAT Application Sci. & Tech. Achievement Cultivation Program

Sci. & Tech. Innovation Special Zone

Publisher

Association for Computing Machinery (ACM)

Reference60 articles.

1. Multimodal semi-supervised learning for text recognition;Aberdam Aviad;arXiv preprint arXiv:2205.03873,2022

2. Vision Transformer for Fast and Efficient Scene Text Recognition

3. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis

4. BEiT: Bert pre-training of image transformers;Bao Hangbo;arXiv preprint arXiv:2106.08254,2021

5. Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3