A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
-
Published:2024-03-21
Issue:6
Volume:14
Page:2657
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Peng Jiajia1, Tang Tianbing1
Affiliation:
1. School of Computer and Electronical Infonmation, Guangxi University, Nanning 530004, China
Abstract
Image captioning, also recognized as the challenge of transforming visual data into coherent natural language descriptions, has persisted as a complex problem. Traditional approaches often suffer from semantic gaps, wherein the generated textual descriptions lack depth, context, or the nuanced relationships contained within the images. In an effort to overcome these limitations, we introduce a novel encoder–decoder framework called A Unified Visual and Linguistic Semantics Method. Our method comprises three key components: an encoder, a mapping network, and a decoder. The encoder employs a fusion of CLIP (Contrastive Language–Image Pre-training) and SegmentCLIP to process and extract salient image features. SegmentCLIP builds upon CLIP’s foundational architecture by employing a clustering mechanism, thereby enhancing the semantic relationships between textual and visual elements in the image. The extracted features are then transformed by a mapping network into a fixed-length prefix. A GPT-2-based decoder subsequently generates a corresponding Chinese language description for the image. This framework aims to harmonize feature extraction and semantic enrichment, thereby producing more contextually accurate and comprehensive image descriptions. Our quantitative assessment reveals that our model exhibits notable enhancements across the intricate AIC-ICC, Flickr8k-CN, and COCO-CN datasets, evidenced by a 2% improvement in BLEU@4 and a 10% uplift in CIDEr scores. Additionally, it demonstrates acceptable efficiency in terms of simplicity, speed, and reduction in computational burden.
Funder
National Natural Science Foundation of China
Reference48 articles.
1. Derkar, S.B., Biranje, D., Thakare, L.P., Paraskar, S., and Agrawal, R. (2023, January 25–27). Captiongenx: Advancements in deep learning for automated image captioning. Proceedings of the 2023 3rd Asian Conference on Innovation in Technology, Pune, India. 2. A comprehensive survey of deep learning for image captioning;Hossain;ACM Comput. Surv.,2019 3. Feng, Y., Ma, L., Liu, W., and Luo, J. (2019, January 15–20). Unsupervised image captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 4. Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker, S., and Florence, P. (2022). Socratic models: Composing zero-shot multimodal reasoning with language. arXiv. 5. Ghanem, F.A., Padma, M.C., and Alkhatib, R. (2023). Automatic short text summarization techniques in social media platforms. Future Internet, 15.
|
|