1. Arandjelovic, R., Gronat, P., Torii, A., et al.: NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR, pp. 5297–5307. IEEE (2016)
2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth $$16\times 16$$ words: Transformers for image recognition at scale. In: ICLR (2021)
3. Lecture Notes in Computer Science;R Droste,2020
4. El-Nouby, A., Neverova, N., Laptev, I., Jégou, H.: Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644 (2021)
5. Lecture Notes in Computer Science;A Grimwood,2020