Affiliation:
1. The Johns Hopkins University, USA
Abstract
Speech interfaces, such as personal assistants and screen readers, read image captions to users. Typically, however, only one caption is available per image, which may not be adequate for all situations (e.g., browsing large quantities of images). Long captions provide a deeper understanding of an image but require more time to listen to, whereas shorter captions may not allow for such thorough comprehension yet have the advantage of being faster to consume. We explore how to effectively collect both thumbnail captions—succinct image descriptions meant to be consumed quickly—and comprehensive captions—which allow individuals to understand visual content in greater detail. We consider text-based instructions and time-constrained methods to collect descriptions at these two levels of detail and find that a time-constrained method is the most effective for collecting thumbnail captions while preserving caption accuracy. Additionally, we verify that caption authors using this time-constrained method are still able to focus on the most important regions of an image by tracking their eye gaze. We evaluate our collected captions along human-rated axes—correctness, fluency, amount of detail, and mentions of important concepts—and discuss the potential for model-based metrics to perform large-scale automatic evaluations in the future.
Funder
Malone Center for Engineering in Healthcare
Publisher
Association for Computing Machinery (ACM)
Subject
Artificial Intelligence,Human-Computer Interaction
Reference73 articles.
1. Crowdsourcing Thumbnail Captions via Time-Constrained Methods
2. SPICE: Semantic Propositional Image Caption Evaluation
3. Francisco Aranda. 2021. spaCy WordNet. Retrieved June 22 2022 fromhttps://pypi.org/project/spacy-wordnet/.
4. Somnath Arjun, G. S. Rajshekar Reddy, Abhishek Mukhopadhyay, Sanjana Vinod, and Pradipta Biswas. 2021. Evaluating visual variables in a virtual reality environment. In 34th British HCI Conference 34. BCS Learning & Development, London, 11–22.
5. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond;Artetxe Mikel;Transactions of the Association for Computational Linguistics,2019
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献