Local feature‐based video captioning with multiple classifier and CARU‐attention-Reference-Cited by-同舟云学术

Local feature‐based video captioning with multiple classifier and CARU‐attention

Published:2024-04-17 Issue:9 Volume:18 Page:2304-2317
ISSN:1751-9659
Container-title:IET Image Processing
language:en
Short-container-title:IET Image Processing

Author:

Im Sio‐Kei¹²^ORCID,Chan Ka‐Hou¹²^ORCID

Affiliation:

1. Faculty of Applied Sciences Macao Polytechnic University Macau China

2. Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence Macao Polytechnic University Macau China

Abstract

AbstractVideo captioning aims to identify multiple objects and their behaviours in a video event and generate captions for the current scene. This task aims to generate a detailed description of the current video in real‐time using natural language, which requires deep learning to analyze and determine the relationships between interesting objects in the frame sequence. In practice, existing methods typically involve detecting objects in the frame sequence and then generating captions based on features extracted through object coverage locations. Therefore, the results of caption generation are highly dependent on the performance of object detection and identification. This work proposes an advanced video captioning approach that works in adaptively and effectively addresses the interdependence between event proposals and captions. Additionally, an attention‐based multimodel framework is introduced to capture the main context from the frame and sound in the video scene. Also, an intermediate model is presented to collect the hidden states captured from the input sequence, which performs to extract the main features and implicitly produce multiple event proposals. For caption prediction, the proposed method employs the CARU layer with attention consideration as the primary RNN layer for decoding. Experimental results showed that the proposed work achieves improvements compared to the baseline method and also better performance compared to other state‐of‐the‐art models on the ActivityNet dataset, presenting competitive results in the tasks of video captioning.

Publisher

Institution of Engineering and Technology (IET)

Reference61 articles.

1. Chen S. Yao T. Jiang Y.G.:Deep learning for video captioning: a review. In:International Joint Conferences on Artificial Intelligence Organization pp.6283–6290.Curran Associates Red Hook NY(2019)

2. Amaresh M. Chitrakala S.:Video captioning using deep learning: an overview of methods datasets and metrics. In:2019 International Conference on Communication and Signal Processing (ICCSP) pp.0656–0661.IEEE Piscataway NJ(2019)

3. A DNA Based Colour Image Encryption Scheme Using A Convolutional Autoencoder

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Parallel Spatio-Temporal Attention Transformer for Video Frame Interpolation;Electronics;2024-05-18