1. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). “Flamingo: A Visual Language Model for Few-shot Learning.” arXiv preprint arXiv:2204.14198.
2. Alikhani, M., Nag Chowdhury, S., de Melo, G., and Stone, M. (2019). “CITE: A Corpus of Image-Text Discourse Relations.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 570–575.
3. Balntas, V., Riba, E., Ponsa, D., and Mikolajczyk, K. (2016). “Learning Local Feature Descriptors with Triplets and Shallow Convolutional Neural Networks.” In Proceedings of the British Machine Vision Conference, pp. 119.1–119.11.
4. Bosselut, A., Levy, O., Holtzman, A., Ennis, C., Fox, D., and Choi, Y. (2018). “Simulating Action Dynamics with Neural Process Networks.” In Proceedings of the 6th International Conference on Learning Representations.
5. Dalvi, B., Huang, L., Tandon, N., Yih, W.-t., and Clark, P. (2018). “Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1595–1604.