Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph-Reference-Cited by-同舟云学术

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Published:2023-02-24 Issue:5 Volume:9 Page:4995-5012
ISSN:2199-4536
Container-title:Complex & Intelligent Systems
language:en
Short-container-title:Complex Intell. Syst.

Author:

Han Shixing,Liu Jin^ORCID,Zhang Jinyingming,Gong Peizhu,Zhang Xiliang,He Huihua

Abstract

AbstractDense video captioning (DVC) aims at generating description for each scene in a video. Despite attractive progress for this task, previous works usually only concentrate on exploiting visual features while neglecting audio information in the video, resulting in inaccurate scene event location. In this article, we propose a novel DVC model named CMCR, which is mainly composed of a cross-modal processing (CM) module and a commonsense reasoning (CR) module. CM utilizes a cross-modal attention mechanism to encode data in different modalities. An event refactoring algorithm is proposed to deal with inaccurate event localization caused by overlapping events. Besides, a shared encoder is utilized to reduce model redundancy. CR optimizes the logic of generated captions with both heterogeneous prior knowledge and entities’ association reasoning achieved by building a knowledge-enhanced unbiased scene graph. Extensive experiments are conducted on ActivityNet Captions dataset, the results demonstrate that our model achieves better performance than state-of-the-art methods. To better understand the performance achieved by CMCR, we also apply ablation experiments to analyze the contributions of different modules.

Funder

Major Research plan of the National Social Science Foundation of China

Publisher

Springer Science and Business Media LLC

Subject

Computational Mathematics,Engineering (miscellaneous),Information Systems,Artificial Intelligence

Link

https://link.springer.com/content/pdf/10.1007/s40747-023-00998-5.pdf

Reference45 articles.

1. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision. pp 4534–4542. https://doi.org/10.1109/ICCV.2015.515

2. Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2014) Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729. https://doi.org/10.3115/v1/N15-1173

3. Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: Proceedings of the IEEE international conference on computer vision. pp 4507–4515 . https://doi.org/10.1109/ICCV.2015.512

4. Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4584–4593. https://doi.org/10.1109/CVPR.2016.496

5. Krishna R, Hata K, Ren F, Fei-Fei L, Carlos NJ (2017) Dense-captioning events in videos. In: Proceedings of the IEEE international conference on computer vision. pp 706–715. https://doi.org/10.1109/ICCV.2017.83

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Implicit and explicit commonsense for multi-sentence video captioning;Computer Vision and Image Understanding;2024-10

2. VT-3DCapsNet: Visual tempos 3D-Capsule network for video-based facial expression recognition;PLOS ONE;2024-08-23

3. Review on scene graph generation methods;Multiagent and Grid Systems;2024-08-12

4. Enhancing Video Captioning: Harnessing CNN Model Features with BiLSTM Integration;2024 IEEE Students Conference on Engineering and Systems (SCES);2024-06-21

5. Tag‐inferring and tag‐guided Transformer for image captioning;IET Computer Vision;2024-03-22