Explicit Image Caption Reasoning: Generating Accurate and Informative Captions for Complex Scenes with LMM-Reference-Cited by-同舟云学术

Explicit Image Caption Reasoning: Generating Accurate and Informative Captions for Complex Scenes with LMM

Published:2024-06-13 Issue:12 Volume:24 Page:3820
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Cui Mingzhang¹^ORCID,Li Caihong²,Yang Yi¹

Affiliation:

1. School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China

2. Key Laboratory of Artificial Intelligence and Computing Power Technology, Lanzhou 730000, China

Abstract

The rapid advancement of sensor technologies and deep learning has significantly advanced the field of image captioning, especially for complex scenes. Traditional image captioning methods are often unable to handle the intricacies and detailed relationships within complex scenes. To overcome these limitations, this paper introduces Explicit Image Caption Reasoning (ECR), a novel approach that generates accurate and informative captions for complex scenes captured by advanced sensors. ECR employs an enhanced inference chain to analyze sensor-derived images, examining object relationships and interactions to achieve deeper semantic understanding. We implement ECR using the optimized ICICD dataset, a subset of the sensor-oriented Flickr30K-EE dataset containing comprehensive inference chain information. This dataset enhances training efficiency and caption quality by leveraging rich sensor data. We create the Explicit Image Caption Reasoning Multimodal Model (ECRMM) by fine-tuning TinyLLaVA with the ICICD dataset. Experiments demonstrate ECR’s effectiveness and robustness in processing sensor data, outperforming traditional methods.

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/12/3820/pdf

Reference63 articles.

1. Multimedia event detection with multimodal feature fusion and temporal concept localization;Oh;Mach. Vis. Appl.,2014

2. Gurari, D., Zhao, Y., Zhang, M., and Bhattacharya, N. (2020). Captioning images taken by people who are blind. Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Springer. Proceedings, Part XVII 16.

3. Johnson, J., Karpathy, A., and Fei-Fei, L. (2016, January 27–30). Densecap: Fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.

4. Thomason, J., Gordon, D., and Bisk, Y. (2018). Shifting the baseline: Single modality performance on visual navigation & qa. arXiv.

5. Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7–12). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.