Contextual Object Detection with Multimodal Large Language Models-Reference-Cited by-同舟云学术

Contextual Object Detection with Multimodal Large Language Models

Published:2024-08-20 Issue: Volume: Page:
ISSN:0920-5691
Container-title:International Journal of Computer Vision
language:en
Short-container-title:Int J Comput Vis

Author:

Zang Yuhang^ORCID,Li Wei,Han Jun,Zhou Kaiyang,Loy Chen Change

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11263-024-02214-4.pdf

Reference87 articles.

1. Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., & Simonyan, K. (2022). Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35, 23716–23736.

2. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition.

3. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, CL., & Parikh, D. (2015). VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision.

4. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., & Divakaran, A. (2018). Zero-shot object detection. In Proceedings of the European conference on computer vision (ECCV).

5. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. In Advances in neural information processing systems.