1. Grounding language models to images for multimodal inputs and outputs;Koh
2. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark;Yin;Advances in Neural Information Processing Systems,2024
3. A Survey on Image-text Multimodal Models;Guo,2023
4. What you see is what you read? improving text-image alignment evaluation;Yarom;Advances in Neural Information Processing Systems,2024
5. Instructblip: Towards general-purpose vision-language models with instruction tuning;Dai;Advances in Neural Information Processing Systems,2024