CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning-Reference-Cited by-同舟云学术

CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning

Published:2023-10-26 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the 31st ACM International Conference on Multimedia
language:
Short-container-title:

Author:

Wang Bo¹^ORCID,Zhang Zhao¹^ORCID,Zhao Suiyi¹^ORCID,Zhang Haijun²^ORCID,Hong Richang¹^ORCID,Wang Meng¹^ORCID

Affiliation:

1. Hefei University of Technology, Hefei, China

2. Harbin Institute of Technology, Shenzhen, Shenzhen, China

Funder

Anhui Provincial Natural Science Fund for the Distinguished Young Scholars

National Natural Science Foundation of China

CAAI-Huawei MindSpore Open Fund

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3581783.3612245

Reference47 articles.

1. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

2. Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2014 . Neural Machine Translation by Jointly Learning to Align and Translate . In Proc. International Conference on Learning Representations , Vol. abs/ 1409 .0473. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. In Proc. International Conference on Learning Representations, Vol. abs/1409.0473.

3. Meshed-Memory Transformer for Image Captioning

4. Long-term recurrent convolutional networks for visual recognition and description

5. Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , and Neil Houlsby . 2020 . An image is worth 16x16 words: Transformers for image recognition at scale . In Proc. International Conference on Learning Representations , Vol. abs/ 2010 .11929. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. International Conference on Learning Representations, Vol. abs/2010.11929.