Style-Enhanced Transformer for Image Captioning in Construction Scenes-Reference-Cited by-同舟云学术

Style-Enhanced Transformer for Image Captioning in Construction Scenes

Published:2024-03-01 Issue:3 Volume:26 Page:224
ISSN:1099-4300
Container-title:Entropy
language:en
Short-container-title:Entropy

Author:

Song Kani¹,Chen Linlin¹,Wang Hengyou¹^ORCID

Affiliation:

1. School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

Abstract

Image captioning is important for improving the intelligence of construction projects and assisting managers in mastering construction site activities. However, there are few image-captioning models for construction scenes at present, and the existing methods do not perform well in complex construction scenes. According to the characteristics of construction scenes, we label a text description dataset based on the MOCS dataset and propose a style-enhanced Transformer for image captioning in construction scenes, simply called SETCAP. Specifically, we extract the grid features using the Swin Transformer. Then, to enhance the style information, we not only use the grid features as the initial detail semantic features but also extract style information by style encoder. In addition, in the decoder, we integrate the style information into the text features. The interaction between the image semantic information and the text features is carried out to generate content-appropriate sentences word by word. Finally, we add the sentence style loss into the total loss function to make the style of generated sentences closer to the training set. The experimental results show that the proposed method achieves encouraging results on both the MSCOCO and the MOCS datasets. In particular, SETCAP outperforms state-of-the-art methods by 4.2% CIDEr scores on the MOCS dataset and 3.9% CIDEr scores on the MSCOCO dataset, respectively.

Funder

National Natural Science Foundation of China

the outstanding Youth Program of Beijing University of Civil Engineering and Architecture

BUCEA Post Graduate Innovation Project

Publisher

MDPI AG

Link

https://www.mdpi.com/1099-4300/26/3/224/pdf

Reference43 articles.

1. Babytalk: Understanding and generating simple image descriptions;Kulkarni;IEEE Trans. Pattern Anal. Mach. Intell.,2013

2. Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7–12). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.

3. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 7–9). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, Lille, France.

4. Image caption generation with dual attention mechanism;Liu;Inf. Process. Manag.,2020

5. Zhang, X., Sun, X., Luo, Y., Ji, J., Zhou, Y., Wu, Y., Huang, F., and Ji, R. (2021, January 20–25). RSTNet: Captioning with adaptive attention on visual and non-visual words. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.