CAST: Cross-Modal Retrieval and Visual Conditioning for image captioning-Reference-Cited by-同舟云学术

CAST: Cross-Modal Retrieval and Visual Conditioning for image captioning

Published:2024-09 Issue: Volume:153 Page:110555
ISSN:0031-3203
Container-title:Pattern Recognition
language:en
Short-container-title:Pattern Recognition

Author:

Cao Shan,An Gaoyun,Cen Yigang,Yang Zhaoqilin^ORCID,Lin Weisi

Funder

China Scholarship Council

National Key Research and Development Program of China

National Natural Science Foundation of China

Publisher

Elsevier BV

Reference49 articles.

1. A. Karpathy, F.F. Li, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.

2. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010.

3. G. Li, L. Zhu, P. Liu, Y. Yang, Entangled Transformer for image captioning, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8927–8936.

4. M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara, Meshed-memory transformer for image captioning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10575–10584.

5. Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C.-W. Lin, R. Ji, Dual-Level Collaborative Transformer for Image Captioning, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 2286–2293.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. M3ixup: A multi-modal data augmentation approach for image captioning;Pattern Recognition;2025-02

2. Vision-language pre-training via modal interaction;Pattern Recognition;2024-12

3. A novel key point based ROI segmentation and image captioning using guidance information;Machine Vision and Applications;2024-09-12