Diverse Visual Question Generation Based on Multiple Objects Selection-Reference-Cited by-同舟云学术

Diverse Visual Question Generation Based on Multiple Objects Selection

Published:2024-03-08 Issue:6 Volume:20 Page:1-22
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Fang Wenhao¹^ORCID,Xie Jiayuan¹^ORCID,Liu Hongfei¹^ORCID,Chen Jiali¹^ORCID,Cai Yi²^ORCID

Affiliation:

1. School of Software Engineering, South China University of Technology, Guangzhou, China and Key Laboratory of Big Data and Intelligent Robot (SCUT), MOE of China

2. School of Software Engineering, South China University of Technology, Guangzhou, China and Key Laboratory of Big Data and Intelligent Robot (SCUT), MOE of China and Peng Cheng Laboratory, Shenzhen, China

Abstract

Visual question generation task aims at generating high-quality questions about a given image. To make this tak applicable to various scenarios, e.g., the growing demand for exams, it is important to generate diverse questions. The existing methods for this task control diverse question generation based on different question types, e.g., “what” and “when.” Although different question types lead to description diversity, they cannot guarantee semantic diversity when asking the same objects. Research in the field of psychology shows that humans pay attention to different objects in an image based on their preferences, which is beneficial to constructing semantically diverse questions. According to the research, we propose a multi-selector visual question generation (MS-VQG) model that aims to focus on different objects to generate diverse questions. Specifically, our MS-VQG model employs multiple selectors to imitate different humans to select different objects in a given image. Based on these different selected objects, our MS-VQG model can generate diverse questions corresponding to each selector. Extensive experiments on two datasets show that our proposed model outperforms the baselines in generating diverse questions.

Funder

National Natural Science Foundation of China

Fundamental Research Funds for the Central Universities, SCUT

Science and Technology Planning Project of Guangdong Province

CAAI-Huawei MindSpore Open FundCCF-Zhipu AI Large Model Fund

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3640014

Reference49 articles.

1. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

2. Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

3. VQA: Visual Question Answering

4. NLTK

5. Fuhai Chen, Rongrong Ji, Jiayi Ji, Xiaoshuai Sun, Baochang Zhang, Xuri Ge, Yongjian Wu, Feiyue Huang, and Yan Wang. 2019. Variational structured semantic inference for diverse image captioning. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS ’19). 1929–1939.