Cross‐modal knowledge learning with scene text for fine‐grained image classification-Reference-Cited by-同舟云学术

Cross‐modal knowledge learning with scene text for fine‐grained image classification

Published:2024-02-19 Issue:6 Volume:18 Page:1447-1459
ISSN:1751-9659
Container-title:IET Image Processing
language:en
Short-container-title:IET Image Processing

Author:

Xiong Li¹²^ORCID,Mao Yingchi¹²^ORCID,Wang Zicheng¹³,Nie Bingbing⁴,Li Chang¹²

Affiliation:

1. School of Computer and Information Hohai University Nanjing China

2. Key Laboratory of Water Big Data Technology of Ministry of Water Resources Hohai University Nanjing China

3. Power China Kunming Engineering Corporation Limited Kunming Yunnan China

4. Huaneng Lancang River Hydropower Corporation Limited Kunming Yunnan China

Abstract

AbstractScene text in natural images carries additional semantic information to aid in image classification. Existing methods lack full consideration of the deep understanding of the text and the visual text relationship, which results in the difficult to judge the semantic accuracy and the relevance of the visual text. This paper proposes image classification based on Cross modal Knowledge Learning of Scene Text (CKLST) method. CKLST consists of three stages: cross‐modal scene text recognition, text semantic enhancement, and visual‐text feature alignment. In the first stage, multi‐attention is used to extract features layer by layer, and a self‐mask‐based iterative correction strategy is utilized to improve the scene text recognition accuracy. In the second stage, knowledge features are extracted using external knowledge and are fused with text features to enhance text semantic information. In the third stage, CKLST realizes visual‐text feature alignment across attention mechanisms with a similarity matrix, thus the correlation between images and text can be captured to improve the accuracy of the image classification tasks. On Con‐Text dataset, Crowd Activity dataset, Drink Bottle dataset, and Synth Text dataset, CKLST can perform significantly better than other baselines on fine‐grained image classification, with improvements of 3.54%, 5.37%, 3.28%, and 2.81% over the best baseline in mAP, respectively.

Publisher

Institution of Engineering and Technology (IET)

Reference43 articles.

1. Szegedy C. Liu W. Jia Y. Sermanet P. Reed S. Anguelov D. Erhan D. Vanhoucke V. Rabinovich A.:Going deeper with convolutions. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp.1–9(2015)

2. Mingxing T. Le Q.V.:EfficientNet: rethinking model scaling for convolutional neural networks. In:Proceedings of the International Conference on Machine Learning.Los Angeles CA USA. p.97(2019)

3. Touvron H. Cord M. Douze M. Massa F. Sablayrolles A. Jégou H.:Training data‐efficient image transformers and distillation through attention. In:International Conference on Machine Learning. pp.10347–10357(2021)

4. Wang W. Xie E. Song X. Zang Y. Wang W. Lu T. Shen C.:Efficient and accurate arbitrary‐shaped text detection with pixel aggregation network. In:Proceedings of the IEEE/CVF International Conference on Computer Vision. Efficient and accurate arbitrary‐shaped text detection with pixel aggregation network.Seoul Korea (South). pp.8440–8449(2019)

5. Huang M. Liu Y. Peng Z. Liu C. Lin D. Zhu S. Jin L.:Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New Orleans LA USA. pp.4593–4603(2022)