1. Grad-cam: Visual explanations from deep networks via gradient-based localization;ramprasaath;Proceedings of the IEEE International Conference on Computer Vision,0
2. Representation learning with contrastive predictive coding;van den oord;ArXiv Preprint,2018
3. 12-in-1: Multi-Task Vision and Language Representation Learning
4. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks;lu;ArXiv Preprint,2019