Towards Human-Interactive Controllable Video Captioning with Efficient Modeling-Reference-Cited by-同舟云学术

Towards Human-Interactive Controllable Video Captioning with Efficient Modeling

Published:2024-06-30 Issue:13 Volume:12 Page:2037
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Heo Yoonseok¹^ORCID,Kim Taehoon²^ORCID,Kim Seunghwan²^ORCID,Seo Jungyun²^ORCID,Kim Juae³^ORCID

Affiliation:

1. Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea

2. LG AI Research, Seoul 07796, Republic of Korea

3. Department of English Linguistics and Language Technology, Division of Language & AI, Hankuk University of Foreign Studies, Seoul 02450, Republic of Korea

Abstract

Video captioning is a task of describing the visual scene of a given video in natural language. There have been several lines of research focused on developing large-scale models in a transfer learning paradigm, with major challenge being the tradeoff between scalability and performance in limited environments. To address this problem, we propose a simple yet effective encoder–decoder-based video captioning model integrating transformers and CLIP, both of which are widely adopted in the vision and language domains, together with appropriate temporal feature embedding modules. Taking this proposal a step further, we also address the challenge of human-interactive video captioning, where the captions are tailored to specific information desired by humans. To design a human-interactive environment, we assume that a human offers an object or action in the video as a short prompt; in turn, the system then provides a detailed explanation regarding the prompt. We embed human prompts within an LSTM-based prompt encoder and leverage soft prompting to tune the model effectively. We extensively evaluated our model on benchmark datasets, demonstrating comparable results, particularly on the MSR-VTT dataset, where we achieve state-of-the-art performance with 4% improvement. In addition, we also show potential for human-interactive video captioning through quantitative and qualitative analysis.

Funder

Hankuk University of Foreign Studies Research Fund

Publisher

MDPI AG

Link

https://www.mdpi.com/2227-7390/12/13/2037/pdf

Reference47 articles.

1. Mokady, R., Hertz, A., and Bermano, A.H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv.

2. Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., and Gao, J. (2021, January 19–25). Vinvl: Revisiting visual representations in vision-language models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual.

3. Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., and Wei, F. (2020, January 23–28). Oscar: Object-semantics aligned pre-training for vision-language tasks. Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XXX 16.

4. Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., and Wang, L. (2022). Git: A generative image-to-text transformer for vision and language. arXiv.

5. Hu, X., Gan, Z., Wang, J., Yang, Z., Liu, Z., Lu, Y., and Wang, L. (2022, January 18–24). Scaling up vision-language pre-training for image captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.