End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning
-
Published:2023-09-07
Issue:
Volume:
Page:
-
ISSN:0218-1266
-
Container-title:Journal of Circuits, Systems and Computers
-
language:en
-
Short-container-title:J CIRCUIT SYST COMP
Author:
Ran Yuting1ORCID,
Fang Bin1ORCID,
Chen Lei1ORCID,
Wei Xuekai1ORCID,
Xian Weizhi12ORCID,
Zhou Mingliang1ORCID
Affiliation:
1. College of Computer Science, Chongqing University, Chongqing 400044, P. R. China
2. Chongqing Research Institute, Harbin Institute of Technology, Chongqing 401151, P. R. China
Abstract
In this paper, we propose an end-to-end dual-stream transformer with a parallel encoder (DST-PE) for video captioning, which combines multimodal features and global–local representations to generate coherent captions. First, we design a parallel encoder that includes a local visual encoder and a bridge module, which simultaneously generates refined local and global visual features. Second, we devise a multimodal encoder to enhance the representation ability of our model. Finally, we adopt a transformer decoder with multimodal features as inputs and local visual features fused with textual features using a cross-attention block. Extensive experimental results demonstrate that our model achieves state-of-the-art performance with low training costs on several widely used datasets.
Funder
Natural Science Foundation of Shanghai
Joint Equipment Pre Research and Key Fund Project of the Ministry of Education
Human Resources and Social Security Bureau Project of Chongqing
Guangdong Oppo Mobile Telecommunications Corporation Ltd.
Publisher
World Scientific Pub Co Pte Ltd
Subject
Electrical and Electronic Engineering,Hardware and Architecture,Electrical and Electronic Engineering,Hardware and Architecture