1. Microsoft coco: Common objects in context;lin;Proc of European Conference on Computer Vision (ECCV),2014
2. Clipcap: Clip prefix for image captioning;mokady;CoRR preprint,2021
3. Efficientnet: Rethinking model scaling for convo-lutional neural networks;tan;Proc of the International Conference on Machine Learning (ICML),2019
4. An image is worth 16×16 words: Transformers for image recognition at scale;dosovitskiy;Proc of International Conference on Learning Representations (ICLR),2020
5. Adam: A method for stochastic optimization;kingma;CoRR preprint,2014