1. Agarwal, S., Krueger, G., Clark, J. et al. (2021). Evaluating clip: towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818
2. Akbari, H., Yuan, L., Qian, R., et al. (2021). Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34, 24206–24221.
3. Arnab, A., Dehghani, M., Heigold, G., et al. (2021). Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6836–6846.
4. Awad, G., Butt, A.A., Curtis, K. et al. (2020). Trecvid 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search & retrieval. arXiv preprint arXiv:2009.09984
5. Beaumont, R. (2022). Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval.