1. Bao, H., Dong, L., Piao, S., & Wei, F. (2022). Beit: Bert pre-training of image transformers. In Proceedings of the International Conference on Learning Representations.
2. Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., Manassra, W., Dhariwal, P., Chu, C., & Jiao, Y. (2023). Improving image generation with better captions. OpenAI blog.
3. Bo, Y., & Fowlkes, C. C. (2011). Shape-based pedestrian parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2265–2272).
4. Borras, A., Tous, F., Llados, J., & Vanrell, M. (2003). High-level clothes description based on colour-texture and structural features. In Iberian Conference on Pattern Recognition and Image Analysis, (pp. 108–116).
5. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.