Abstract
AbstractVideo Question Answering (VideoQA) concerns the realization of models able to analyze a video, and produce a meaningful answer to visual content-related questions. To encode the given question, word embedding techniques are used to compute a representation of the tokens suitable for neural networks. Yet almost all the works in the literature use the same technique, although recent advancements in NLP brought better solutions. This lack of analysis is a major shortcoming. To address it, in this paper we present a twofold contribution about this inquiry and its relation with question encoding. First of all, we integrate four of the most popular word embedding techniques in three recent VideoQA architectures, and investigate how they influence the performance on two public datasets: EgoVQA and PororoQA. Thanks to the learning process, we show that embeddings carry question type-dependent characteristics. Secondly, to leverage this result, we propose a simple yet effective multi-task learning protocol which uses an auxiliary task defined on the question types. By using the proposed learning strategy, significant improvements are observed in most of the combinations of network architecture and embedding under analysis.
Funder
Università degli Studi di Udine
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Hardware and Architecture,Media Technology,Software
Reference62 articles.
1. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: ICLR
2. Del Barrio E, Cuesta-Albertos JA, Matrán C (2018) An optimal transportation approach for assessing almost stochastic order. In: The mathematics of the uncertain. Springer, p 33–44
3. Devlin J, Chang MW, Lee K, et al. (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT
4. Dror R, Shlomov S, Reichart R (2019) Deep dominance - how to properly compare deep neural models. In: Korhonen A, Traum D R, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1 : Long Papers. Association for Computational Linguistics, pp 2773–2785. https://doi.org/10.18653/v1/p19-1266
5. Fan C (2019) Egovqa - an egocentric video question answering benchmark dataset. In: ICCV Workshop
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献