Dual modality prompt learning for visual question-grounded answering in robotic surgery-Reference-Cited by-同舟云学术

Dual modality prompt learning for visual question-grounded answering in robotic surgery

Published:2024-04-22 Issue:1 Volume:7 Page:
ISSN:2524-4442
Container-title:Visual Computing for Industry, Biomedicine, and Art
language:en
Short-container-title:Vis. Comput. Ind. Biomed. Art

Author:

Zhang Yue,Fan Wanshu^ORCID,Peng Peixi,Yang Xin,Zhou Dongsheng,Wei Xiaopeng

Abstract

AbstractWith recent advancements in robotic surgery, notable strides have been made in visual question answering (VQA). Existing VQA systems typically generate textual answers to questions but fail to indicate the location of the relevant content within the image. This limitation restricts the interpretative capacity of the VQA models and their ability to explore specific image regions. To address this issue, this study proposes a grounded VQA model for robotic surgery, capable of localizing a specific region during answer prediction. Drawing inspiration from prompt learning in language models, a dual-modality prompt model was developed to enhance precise multimodal information interactions. Specifically, two complementary prompters were introduced to effectively integrate visual and textual prompts into the encoding process of the model. A visual complementary prompter merges visual prompt knowledge with visual information features to guide accurate localization. The textual complementary prompter aligns visual information with textual prompt knowledge and textual information, guiding textual information towards a more accurate inference of the answer. Additionally, a multiple iterative fusion strategy was adopted for comprehensive answer reasoning, to ensure high-quality generation of textual and grounded answers. The experimental results validate the effectiveness of the model, demonstrating its superiority over existing methods on the EndoVis-18 and EndoVis-17 datasets.

Funder

111 Project

National Key Research and Development Program of China

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s42492-024-00160-z.pdf

Reference40 articles.

1. Wu DC, Wang YH, Ma HM, Ai LY, Yang JL, Zhang SJ et al (2023) Adaptive feature extraction method for capsule endoscopy images. Vis Comput Ind Biomed Art 6(1):24. https://doi.org/10.1186/S42492-023-00151-6

2. Pan J, Lv RJ, Wang Q, Zhao XB, Liu JG, Ai L (2023) Discrimination between leucine-rich glioma-inactivated 1 antibody encephalitis and gamma-aminobutyric acid B receptor antibody encephalitis based on ResNet18. Vis Comput Ind Biomed Art 6(1):17. https://doi.org/10.1186/S42492-023-00144-5

3. Sarmah M, Neelima A, Singh HR (2023) Survey of methods and principles in three-dimensional reconstruction from two-dimensional medical images. Vis Comput Ind Biomed Art 6(1):15. https://doi.org/10.1186/S42492-023-00142-7

4. Khan AU, Kuehne H, Duarte K, Gan C, Lobo N, Shah M (2021) Found a reason for me? Weakly-supervised grounded visual question answering using capsules. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Nashville. https://doi.org/10.1109/CVPR46437.2021.00836

5. Anderson P, He XD, Buehler C, Teney D, Johnson M, Gould S et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the 2018 IEEE/CVF conference on computer vision and pattern recognition. IEEE, Salt Lake City. https://doi.org/10.1109/CVPR.2018.00636