Abstract
AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.
Publisher
Springer Science and Business Media LLC
Reference50 articles.
1. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: ECCV
2. Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
3. Barz M, Sonntag D (2019) Incremental improvement of a question answering system by re-ranking answer candidates using machine learning. CoRR. arXiv:1908.10149
4. Biswas R (2019) Diverse image caption generation and automated human judgement through active learning. Master’s thesis, Saarland University
5. Biswas R, Mogadala A, Barz M, Sonntag D, Klakow D (2019) Automatic judgement of neural network-generated image captions. In: Statistical language and speech processing—7th international conference, SLSP 2019, Ljubljana, Slovenia, October 14–16, 2019, Proceedings, pp 261–272
Cited by
23 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献