Author:
Fang Zhiwei,Liu Jing,Liu Xueliang,Tang Qu,Li Yong,Lu Hanqing
Abstract
Bilinear models are very powerful in multimodal fusion tasks like Visual Question Answering. The predominant bilinear methods can all be seen as a kind of tensor-based decomposition operation that contains a key kernel called “core tensor.” Current approaches usually focus on reducing the computation complexity by applying low-rank constraint on the core tensor. In this article, we propose a novel bilinear architecture called Block Term Decomposition Pooling (BTDP), which not only maintains the advantages of previous bilinear methods but also conducts sparse bilinear interactions between modalities. Our method is based on Block Term Decompositions theory of tensor, which will result in a sparse and learnable block-diagonal core tensor for multimodal fusion. We prove that using such a block-diagonal core tensor is equivalent to conducting many “tiny” bilinear operations in different feature spaces. Thus, introducing sparsity into the bilinear operation can significantly increase the performance of feature fusion and improve VQA models. What is more, our BTDP is very flexible in design. We develop several variants of BTDP and discuss the effects of the diagonal blocks of core tensor. Extensive experiments on two challenging VQA-v1 and VQA-v2 datasets show that our BTDP method outperforms current bilinear models, achieving state-of-the-art performance.
Funder
National Natural Science Foundation of China
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Diverse Visual Question Generation based on Multiple Objects Selection;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-15
2. Visual Paraphrase Generation with Key Information Retained;ACM Transactions on Multimedia Computing, Communications, and Applications;2023-05-30
3. Answer Questions with Right Image Regions: A Visual Attention Regularization Approach;ACM Transactions on Multimedia Computing, Communications, and Applications;2022-03-04
4. Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching;ACM Transactions on Multimedia Computing, Communications, and Applications;2021-11-30
5. Study on the Connotation and Framework of Regional Integrated Energy System;IOP Conference Series: Earth and Environmental Science;2020-02-01