RCAT: Retentive CLIP Adapter Tuning for Improved Video Recognition
-
Published:2024-03-02
Issue:5
Volume:13
Page:965
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Xie Zexun1ORCID, Xu Min1ORCID, Zhang Shudong2, Zhou Lijuan2
Affiliation:
1. Information Engineering College, Capital Normal University, 56 West 3rd Ring North Road, Beijing 100048, China 2. School of Cyberspace Security, Hainan University, 58 Renmin Avenue, Haikou 570228, China
Abstract
The advent of Contrastive Language-Image Pre-training (CLIP) models has revolutionized the integration of textual and visual representations, significantly enhancing the interpretation of static images. However, their application to video recognition poses unique challenges due to the inherent dynamism and multimodal nature of video content, which includes temporal changes and spatial details beyond the capabilities of traditional CLIP models. These challenges necessitate an advanced approach capable of comprehending the complex interplay between the spatial and temporal dimensions of video data. To this end, this study introduces an innovative approach, Retentive CLIP Adapter Tuning (RCAT), which synergizes the foundational strengths of CLIP with the dynamic processing prowess of a Retentive Network (RetNet). Specifically designed to refine CLIP’s applicability to video recognition, RCAT facilitates a nuanced understanding of video sequences by leveraging temporal analysis. At the core of RCAT is its specialized adapter tuning mechanism, which modifies the CLIP model to better align with the temporal intricacies and spatial details of video content, thereby enhancing the model’s predictive accuracy and interpretive depth. Our comprehensive evaluations on benchmark datasets, including UCF101, HMDB51, and MSR-VTT, underscore the effectiveness of RCAT. Our proposed approach achieves notable accuracy improvements of 1.4% on UCF101, 2.6% on HMDB51, and 1.1% on MSR-VTT compared to existing models, illustrating its superior performance and adaptability in the context of video recognition tasks.
Funder
National Natural Science Foundation of China
Reference40 articles.
1. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021, January 18–24). Learning transferable visual models from natural language supervision. Proceedings of the International Conference on Machine Learning, Virtual. 2. Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. (2023). Retentive network: A successor to transformer for large language models. arXiv. 3. Attention is all you need;Vaswani;Adv. Neural Inf. Process. Syst.,2017 4. Karasawa, H., Liu, C.L., and Ohwada, H. (2018, January 19–21). Deep 3d convolutional neural network architectures for alzheimer’s disease diagnosis. Proceedings of the Intelligent Information and Database Systems: 10th Asian Conference, ACIIDS 2018, Dong Hoi City, Vietnam. Proceedings, Part I 10. 5. Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., and Qiao, Y. (2023). Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 12581–12600.
|
|