MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation-Reference-Cited by-同舟云学术

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Published:2023-02-07 Issue:4 Volume:17 Page:389-403
ISSN:1751-9632
Container-title:IET Computer Vision
language:en
Short-container-title:IET Computer Vision

Author:

Yuan Jingshu¹²^ORCID,Yun Jing¹²,Zheng Bofei¹²,Jiao Lei¹²,Liu Limin¹

Affiliation:

1. College of Data Science and Application Inner Mongolia University of Technology Huhhot China

2. Inner Mongolia Autonomous Region Engineering & Technology Research Center of Big Data Based Software Service Huhhot China

Abstract

AbstractMultimodal abstractive summarisation (MAS) aims to generate a textual summary from multimodal data collection, such as video‐text pairs. Despite the success of recent work, the existing methods lack a thorough analysis for consistency across multimodal data. Besides, previous work relies on the fusion method to extract multimodal semantics, neglecting the constraints for complementary semantics of each modality. To address those issues, a multilayer cross‐fusion model with the reconstructor for the MAS task is proposed. Their model could thoroughly conduct cross‐fusion for each modality via layers of cross‐modal transformer blocks, resulting in cross‐modal fusion representations with consistency across modalities. Then the reconstructor is employed to reproduce source modalities based on cross‐modal fusion representations. The reconstruction process constrains the fusion representations with the complementary semantics of each modality. Comprehensive comparison and ablation experiments on the open domain multimodal dataset How2 are proposed. The results empirically verify the effectiveness of the multilayer cross‐fusion with the reconstructor structure on the proposed model.

Funder

Natural Science Foundation of Inner Mongolia

National Natural Science Foundation of China

Department of Science and Technology of Inner Mongolia

Publisher

Institution of Engineering and Technology (IET)

Subject

Computer Vision and Pattern Recognition,Software

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1049/cvi2.12173

Reference50 articles.

1. Multimodal Abstractive Summarization for How2 Videos

2. D-MmT: A concise decoder-only multi-modal transformer for abstractive summarization in videos

3. Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

4. Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences

5. Reconstruction Network for Video Captioning

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. High-throughput analysis of hazards in novel food based on the density functional theory and multimodal deep learning;Food Chemistry;2024-06

2. Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism;Heliyon;2024-02