Abstract
AbstractIn recent years, significant progress has been made in visual-linguistic multi-modality research, leading to advancements in visual comprehension and its applications in computer vision tasks. One fundamental task in visual-linguistic understanding is image captioning, which involves generating human-understandable textual descriptions given an input image. This paper introduces a referring expression image captioning model that incorporates the supervision of interesting objects. Our model utilizes user-specified object keywords as a prefix to generate specific captions that are relevant to the target object. The model consists of three modules including: (i) visual grounding, (ii) referring object selection, and (iii) image captioning modules. To evaluate its performance, we conducted experiments on the RefCOCO and COCO captioning datasets. The experimental results demonstrate that our proposed method effectively generates meaningful captions aligned with users’ specific interests.
Publisher
Springer Science and Business Media LLC
Reference39 articles.
1. Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763 (PMLR, 2021).
2. Johnson, J., Karpathy, A. & Fei-Fei, L. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4565–4574 (2016).
3. Lin, T.-Y., RoyChowdhury, A. & Maji, S. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1309–1322 (2017).
4. Fukui, A. et al. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016).
5. Hodosh, M., Young, P. & Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013).