Recurrent Attention Network with Reinforced Generator for Visual Dialog-Reference-Cited by-同舟云学术

Recurrent Attention Network with Reinforced Generator for Visual Dialog

Published:2020-08-31 Issue:3 Volume:16 Page:1-16
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Fan Hehe¹^ORCID,Zhu Linchao²^ORCID,Yang Yi²^ORCID,Wu Fei³

Affiliation:

1. Center for Artificial Intelligence, University of Technology Sydney and Baidu Research, Beijing, China

2. Center for Artificial Intelligence, University of Technology Sydney, Sydney, NSW, Australia

3. College of Computer Science, Zhejiang University, Zhejiang, China

Abstract

In Visual Dialog, an agent has to parse temporal context in the dialog history and spatial context in the image to hold a meaningful dialog with humans. For example, to answer “what is the man on her left wearing?” the agent needs to (1) analyze the temporal context in the dialog history to infer who is being referred to as “her,” (2) parse the image to attend “her,” and (3) uncover the spatial context to shift the attention to “her left” and check the apparel of the man. In this article, we use a dialog network to memorize the temporal context and an attention processor to parse the spatial context. Since the question and the image are usually very complex, which makes it difficult for the question to be grounded with a single glimpse, the attention processor attends to the image multiple times to better collect visual information. In the Visual Dialog task, the generative decoder (G) is trained under the word-by-word paradigm, which suffers from the lack of sentence-level training. We propose to reinforce G at the sentence level using the discriminative model (D), which aims to select the right answer from a few candidates, to ameliorate the problem. Experimental results on the VisDial dataset demonstrate the effectiveness of our approach.

Funder

Australian Research Council

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3390891

Reference48 articles.

1. Neural Module Networks

2. VQA: Visual Question Answering

3. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.

4. Deep Attention Neural Tensor Network for Visual Question Answering

5. Visual Dialog

Cited by 41 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-07-08

2. Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic Elements;ACM Transactions on Intelligent Systems and Technology;2024-04-15

3. Attention-Aware Meta-Reweighted Optimization for Enhanced Intelligent Fault Diagnosis;IEEE Access;2024

4. Counterfactual Visual Dialog: Robust Commonsense Knowledge Learning From Unbiased Training;IEEE Transactions on Multimedia;2024

5. Parallel encoder–decoder framework for image captioning;Knowledge-Based Systems;2023-12