1. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing systems, pp 6000—6010
2. Devlin J, Chang MW, Lee K, et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
3. Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR, pp 8748–8763, arXiv:1609.08124
4. Liu S, Fan H, Qian S, et al (2021) Hit: hierarchical transformer with momentum contrast for video-text retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11,915–11,925, https://doi.org/10.1109/ICCV48922.2021.01170
5. Gabeur V, Sun C, Alahari K, et al (2020) Multi-modal transformer for video retrieval. In: European conference on computer vision. Springer, pp 214–229, https://doi.org/10.1007/978-3-030-58548-8_13