Towards Human-Interactive Controllable Video Captioning with Efficient Modeling
-
Published:2024-06-30
Issue:13
Volume:12
Page:2037
-
ISSN:2227-7390
-
Container-title:Mathematics
-
language:en
-
Short-container-title:Mathematics
Author:
Heo Yoonseok1ORCID, Kim Taehoon2ORCID, Kim Seunghwan2ORCID, Seo Jungyun2ORCID, Kim Juae3ORCID
Affiliation:
1. Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea 2. LG AI Research, Seoul 07796, Republic of Korea 3. Department of English Linguistics and Language Technology, Division of Language & AI, Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea
Abstract
Video captioning is a task of describing the visual scene of a given video in natural language. There have been several lines of research focused on developing large-scale models in a transfer learning paradigm, with major challenge being the tradeoff between scalability and performance in limited environments. To address this problem, we propose a simple yet effective encoder–decoder-based video captioning model integrating transformers and CLIP, both of which are widely adopted in the vision and language domains, together with appropriate temporal feature embedding modules. Taking this proposal a step further, we also address the challenge of human-interactive video captioning, where the captions are tailored to specific information desired by humans. To design a human-interactive environment, we assume that a human offers an object or action in the video as a short prompt; in turn, the system then provides a detailed explanation regarding the prompt. We embed human prompts within an LSTM-based prompt encoder and leverage soft prompting to tune the model effectively. We extensively evaluated our model on benchmark datasets, demonstrating comparable results, particularly on the MSR-VTT dataset, where we achieve state-of-the-art performance with 4% improvement. In addition, we also show potential for human-interactive video captioning through quantitative and qualitative analysis.
Funder
Hankuk University of Foreign Studies Research Fund
Reference47 articles.
1. Mokady, R., Hertz, A., and Bermano, A.H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv. 2. Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., and Gao, J. (2021, January 19–25). Vinvl: Revisiting visual representations in vision-language models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual. 3. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., and Wei, F. (2020, January 23–28). Oscar: Object-semantics aligned pre-training for vision-language tasks. Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XXX 16. 4. Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. (2022). Git: A generative image-to-text transformer for vision and language. arXiv. 5. Hu, X., Gan, Z., Wang, J., Yang, Z., Liu, Z., Lu, Y., and Wang, L. (2022, January 18–24). Scaling up vision-language pre-training for image captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.
|
|