End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning-Reference-Cited by-同舟云学术

End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

Published:2023-09-07 Issue: Volume: Page:
ISSN:0218-1266
Container-title:Journal of Circuits, Systems and Computers
language:en
Short-container-title:J CIRCUIT SYST COMP

Author:

Ran Yuting¹^ORCID,Fang Bin¹^ORCID,Chen Lei¹^ORCID,Wei Xuekai¹^ORCID,Xian Weizhi¹²^ORCID,Zhou Mingliang¹^ORCID

Affiliation:

1. College of Computer Science, Chongqing University, Chongqing 400044, P. R. China

2. Chongqing Research Institute, Harbin Institute of Technology, Chongqing 401151, P. R. China

Abstract

In this paper, we propose an end-to-end dual-stream transformer with a parallel encoder (DST-PE) for video captioning, which combines multimodal features and global–local representations to generate coherent captions. First, we design a parallel encoder that includes a local visual encoder and a bridge module, which simultaneously generates refined local and global visual features. Second, we devise a multimodal encoder to enhance the representation ability of our model. Finally, we adopt a transformer decoder with multimodal features as inputs and local visual features fused with textual features using a cross-attention block. Extensive experimental results demonstrate that our model achieves state-of-the-art performance with low training costs on several widely used datasets.

Funder

Natural Science Foundation of Shanghai

Joint Equipment Pre Research and Key Fund Project of the Ministry of Education

Human Resources and Social Security Bureau Project of Chongqing

Guangdong Oppo Mobile Telecommunications Corporation Ltd.

Publisher

World Scientific Pub Co Pte Ltd

Subject

Electrical and Electronic Engineering,Hardware and Architecture,Electrical and Electronic Engineering,Hardware and Architecture

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218126624500749