1. Bai, S., Zheng, Z., Wang, X., Lin, J., Zhang, Z., Zhou, C., Yang, H., Yang, Y.: Connecting language and vision for natural language-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4034–4043 (2021)
2. Clark, K., Luong, M., Le, Q., Manning, C.: Electra: Pre-training Text Encoders as Discriminators Rather than Generators (2020). ArXiv Preprint ArXiv:2003.10555
3. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018). arXiv preprint arXiv:1810.04805
4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An Image Is Worth 16$$\,\times \,$$16 Words: Transformers for Image Recognition at Scale (2020). ArXiv Preprint ArXiv:2010.11929
5. Feng, Q., Ablavsky, V., Sclaroff, S.: CityFlow-NL: Tracking and Retrieval of Vehicles at City Scale by Natural Language Descriptions (2021). ArXiv Preprint ArXiv:2101.04741