1. VQA: Visual Question Answering
2. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems Vol. 33 (2020) 1877--1901. Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems Vol. 33 (2020) 1877--1901.
3. Revisiting Parameter-Efficient Tuning: Are We Really There Yet?
4. Jaemin Cho , Jie Lei , Hao Tan , and Mohit Bansal . 2021 . Unifying vision-and-language tasks via text generation . In International Conference on Machine Learning. PMLR , 1931--1942. Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning. PMLR, 1931--1942.
5. An Empirical Study of Training End-to-End Vision-and-Language Transformers