INTEGRATING IMAGE FEATURES WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE NETWORK FOR MULTILINGUAL VISUAL QUESTION ANSWERING-Reference-Cited by-同舟云学术

INTEGRATING IMAGE FEATURES WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE NETWORK FOR MULTILINGUAL VISUAL QUESTION ANSWERING

Published:2024-03-16 Issue:2 Volume:40 Page:117-134
ISSN:2815-5939
Container-title:Journal of Computer Science and Cybernetics
language:
Short-container-title:JCC

Author:

Thai Triet,Luu Son T.

Abstract

Visual question answering is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease, but it is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual question answering task in the multilingual domain on a newly released dataset UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese, and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with a convolutional sequence-to-sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set and 0.4210 on the private test set.

Publisher

Publishing House for Science and Technology, Vietnam Academy of Science and Technology (Publications)

Reference31 articles.

1. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.

2. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2016.

3. I. Chowdhury, K. Nguyen, C. Fookes, and S. Sridharan, “A cascaded long short-term memory (lstm) driven generic visual question answering (vqa),” in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 1842–1846.

4. Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” CoRR, vol. abs/1612.08083, 2016. [Online]. Available: http://arxiv.org/abs/1612.08083

5. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https: //aclanthology.org/N19-1423