Video question answering supported by a multi-task learning objective-Reference-Cited by-同舟云学术

Video question answering supported by a multi-task learning objective

Published:2023-03-24 Issue:25 Volume:82 Page:38799-38826
ISSN:1380-7501
Container-title:Multimedia Tools and Applications
language:en
Short-container-title:Multimed Tools Appl

Author:

Falcon Alex^ORCID,Serra Giuseppe,Lanz Oswald

Abstract

AbstractVideo Question Answering (VideoQA) concerns the realization of models able to analyze a video, and produce a meaningful answer to visual content-related questions. To encode the given question, word embedding techniques are used to compute a representation of the tokens suitable for neural networks. Yet almost all the works in the literature use the same technique, although recent advancements in NLP brought better solutions. This lack of analysis is a major shortcoming. To address it, in this paper we present a twofold contribution about this inquiry and its relation with question encoding. First of all, we integrate four of the most popular word embedding techniques in three recent VideoQA architectures, and investigate how they influence the performance on two public datasets: EgoVQA and PororoQA. Thanks to the learning process, we show that embeddings carry question type-dependent characteristics. Secondly, to leverage this result, we propose a simple yet effective multi-task learning protocol which uses an auxiliary task defined on the question types. By using the proposed learning strategy, significant improvements are observed in most of the combinations of network architecture and embedding under analysis.

Funder

Università degli Studi di Udine

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Hardware and Architecture,Media Technology,Software

Link

https://link.springer.com/content/pdf/10.1007/s11042-023-14333-0.pdf

Reference62 articles.

1. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: ICLR

2. Del Barrio E, Cuesta-Albertos JA, Matrán C (2018) An optimal transportation approach for assessing almost stochastic order. In: The mathematics of the uncertain. Springer, p 33–44

3. Devlin J, Chang MW, Lee K, et al. (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT

4. Dror R, Shlomov S, Reichart R (2019) Deep dominance - how to properly compare deep neural models. In: Korhonen A, Traum D R, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1 : Long Papers. Association for Computational Linguistics, pp 2773–2785. https://doi.org/10.18653/v1/p19-1266

5. Fan C (2019) Egovqa - an egocentric video question answering benchmark dataset. In: ICCV Workshop

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Toward a System of Visual Classification, Analysis and Recognition of Performance-Based Moving Images in the Artistic Field;Lecture Notes in Computer Science;2024