Counting in Visual Question Answering: Methods, Datasets, and Future Work-Reference-Cited by-同舟云学术

Counting in Visual Question Answering: Methods, Datasets, and Future Work

Published:2023-10-20 Issue: Volume: Page:
ISSN:0219-4678
Container-title:International Journal of Image and Graphics
language:en
Short-container-title:Int. J. Image Grap.

Author:

Welde Tesfayee Meshu¹^ORCID,Liao Lejian¹^ORCID

Affiliation:

1. Department of Computer Science and Information Technology, Beijing Institute of Technology, Beijing 100081, China

Abstract

Visual Question Answering (VQA) is a language-based method for analyzing images, which is highly helpful in assisting people with visual impairment. The VQA system requires a demonstrated holistic image understanding and conducts basic reasoning tasks concerning the image in contrast to the specific task-oriented models that simply classifies object into categories. Thus, VQA systems contribute to the growth of Artificial Intelligence (AI) technology by answering open-ended, arbitrary questions about a given image. In addition, VQA is also used to assess the system’s ability by conducting Visual Turing Test (VTT). However, because of the inability to generate the essential datasets and being incapable of evaluating the systems due to flawlessness and bias, the VQA system is incapable of assessing the system’s overall efficiency. This is seen as a possible and significant limitation of the VQA system. This, in turn, has a negative impact on the progress of performance observed in VQA algorithms. Currently, the research on the VQA system is dealing with more specific sub-problems, which include counting in VQA systems. The counting sub-problem of VQA is a more sophisticated one, riddling with several challenging questions, especially when it comes to complex counting questions such as those that demand object identifications along with detection of objects attributes and positional reasoning. The pooling operation that is considered to perform an attention mechanism in VQA is found to degrade the counting performance. A number of algorithms have been developed to address this issue. In this paper, we provide a comprehensive survey of counting techniques in the VQA system that is developed especially for answering questions such as “How many?”. However, the performance progress achieved by this system is still not satisfactory due to bias that occurs in the datasets from the way we phrase the questions and because of weak evaluation metrics. In the future, fully-fledged architecture, wide-size datasets with complex counting questions and a detailed breakdown in categories, and strong evaluation metrics for evaluating the ability of the system to answer complex counting questions, such as positional and comparative reasoning will be executed.

Funder

China Scholarship Council

Publisher

World Scientific Pub Co Pte Ltd

Subject

Computer Graphics and Computer-Aided Design,Computer Science Applications,Computer Vision and Pattern Recognition

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0219467825500445

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Overcoming the Limitations of Learning-Based VQA for Counting Questions with Zero-Shot Learning;International Journal on Artificial Intelligence Tools;2024-08-20