Chinese image captioning with fusion encoder and visual keyword search-Reference-Cited by-同舟云学术

Chinese image captioning with fusion encoder and visual keyword search

Published:2024-06-09 Issue:11 Volume:18 Page:3055-3069
ISSN:1751-9659
Container-title:IET Image Processing
language:en
Short-container-title:IET Image Processing

Author:

Zou Yang¹^ORCID,Liao Shiyu¹,Wang Qifei¹

Affiliation:

1. Institute of Intelligence Science and Technology College of Computer and Information Hohai University Nanjing China

Abstract

AbstractAutomatic generation of image captions is essentially a cross‐modal conversion from image to text. Owing to the differences in linguistic characteristics between Chinese and English, quite a few Chinese image captioning methods have recently been proposed. Nevertheless, the existing Chinese image captioning models usually lack attention to local details of images or tend to produce general descriptions. To address these challenges, a Chinese image captioning method is proposed that incorporates fusion encoder, visual keyword search, and reinforcement learning. The fusion encoder can simultaneously extract local and global features of the input image to enrich the semantic information in the decoding stage, visual keyword search can pursue potential visual words associated with the image content, and the reinforcement learning mechanism can optimize the evaluation metric CIDEr at sentence level to promote the lexical diversity of image description. The results of extensive experiments demonstrate that the proposed model outperforms the state‐of‐the‐art models and delivers expressive and informative Chinese image captions.

Publisher

Institution of Engineering and Technology (IET)

Reference33 articles.

1. Vinyals O. Toshev A. Bengio S. Erhan D.:Show and tell: A neural image caption generator. In:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition pp.3156–3164.IEEE Piscataway NJ(2015)

2. Li P. Ma J. Gao S.:Learning to summarize web image and text mutually. In:Proceedings of the 2nd ACM International Conference on Multimedia Retrieval pp.28–36.Association for Computing Machinery New York NY(2012)

3. Ordonez V. Kulkarni G. Berg T.L.:Im2Text: Describing images using 1 million captioned photographs. In:Proceedings of the 25th Annual Conference on Neural Information Processing Systems pp.1143–1151.Curran Associates Inc. Red Hook NY(2011)

4. Mason R. Charniak E.:Nonparametric method for data‐driven image captioning. In:Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics pp.592–598.Association for Computational Linguistics Stroudsburg Pennsylvania(2014)

5. Farhadi A. Endres I. Hoiem D. Forsyth D.:Describing objects by their attributes. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp.1778–1785.IEEE Piscataway NJ(2009)