TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering-Reference-Cited by-同舟云学术

TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering

Published:2023-02-22 Issue:4 Volume:5 Page:
ISSN:2640-4567
Container-title:Advanced Intelligent Systems
language:en
Short-container-title:Advanced Intelligent Systems

Author:

Wang Tian¹^ORCID,Hou Boyao²,Li Jiakun²,Shi Peng³,Zhang Baochang¹,Snoussi Hichem⁴

Affiliation:

1. Institute of Artificial Intelligence Beihang University Beijing 100083 China

2. School of Automation Science and Electrical Engineering Beihang University Beijing 100083 China

3. College of Computer and Cyber Security Fujian Normal University Fuzhou Fujian 350117 China

4. Institute Charles Delaunay University of Technology of Troyes 10004 Troyes France

Abstract

Video question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for this task due to the large amount of irrelevant information in the video, and explicitly learning an attention model can be a reasonable and effective solution for the selection. Herein, a novel VideoQA model called Text‐Assisted Spatial and Temporal Attention Network (TASTA) is proposed, which shows the great potential of explicitly modeling attention. TASTA is made to be simple, small, clean, and efficient for clear performance justification and possible easy extension. Its success is mainly from two new strategies of better using the textual information. Experimental results on a large and most representative dataset, TGIF‐QA, show the significant superiority of TASTA w.r.t. the state‐of‐the‐art and demonstrate the effectiveness of its key components via ablation studies.

Funder

National Natural Science Foundation of China

Natural Science Foundation of Beijing

Publisher

Wiley

Subject

General Medicine

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/aisy.202200131

Reference43 articles.

1. Smart Manufacturing for Smart Cities—Overview, Insights, and Future Directions

2. Artificial Intelligence‐Enabled Sensing Technologies in the 5G/Internet of Things Era: From Virtual Reality/Augmented Reality to the Digital Twin

3. Interaction augmented transformer with decoupled decoding for video captioning

4. Long Short-Term Relation Transformer With Global Gating for Video Captioning