Multi-modal Circulant Fusion for Video-to-Language and Backward-Reference-Cited by-同舟云学术

Multi-modal Circulant Fusion for Video-to-Language and Backward

Published:2018-07 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Wu Aming¹,Han Yahong¹

Affiliation:

1. School of Computer Science and Technology, Tianjin University, Tianjin, China

Abstract

Multi-modal fusion has been widely involved in focuses of the modern artificial intelligence research, e.g., from visual content to languages and backward. Common-used multi-modal fusion methods mainly include element-wise product, element-wise sum, or even simply concatenation between different types of features, which are somewhat straightforward but lack in-depth analysis. Recent studies have shown fully exploiting interactions among elements of multi-modal features will lead to a further performance gain. In this paper, we put forward a new approach of multi-modal fusion, namely Multi-modal Circulant Fusion (MCF). Particularly, after reshaping feature vectors into circulant matrices, we define two types of interaction operations between vectors and matrices. As each row of the circulant matrix shifts one elements, with newly-defined interaction operations, we almost explore all possible interactions between vectors of different modalities. Moreover, as only regular operations are involved and defined a priori, MCF avoids increasing parameters or computational costs for multi-modal fusion. We evaluate MCF with tasks of video captioning and temporal activity localization via language (TALL). Experiments on MSVD and MSRVTT show our method obtains the state-of-the-art for video captioning. For TALL, by plugging into MCF, we achieve a performance gain of roughly 4.2% on TACoS.

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SCT-CR: A synergistic convolution-transformer modeling method using SAR-optical data fusion for cloud removal;International Journal of Applied Earth Observation and Geoinformation;2024-06

2. Object Centered Video Captioning using Spatio-temporal Graphs;2024 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI);2024-03-14

3. GLMDriveNet: Global–local Multimodal Fusion Driving Behavior Classification Network;Engineering Applications of Artificial Intelligence;2024-03

4. Dynamic Pathway for Query-Aware Feature Learning in Language-Driven Action Localization;IEEE Transactions on Multimedia;2024

5. MARN: Multi-level Attentional Reconstruction Networks for Weakly Supervised Video Temporal Grounding;Neurocomputing;2023-10