Affiliation:
1. Experimental Ophthalmology, The Hong Kong Polytechnic University, Hong Kong, People's Republic of China
2. State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science
3. The Hong Kong Polytechnic University, China.
4. The Hong Kong Polytechnic University
5. Zhongshan Ophthalmic Center,Sun Yat-sen University
6. School of Optometry, The Hong Kong Polytechnic University,China
Abstract
Abstract
Background: While large language models (LLMs) have demonstrated impressive capabilities in question-answering (QA) tasks, their utilization in analyzing ocular imaging data remains limited. We aim to develop an interactive system that harnesses LLMs for report generation and visual question answering in the context of fundus fluorescein angiography (FFA).Methods: Our system comprises two components: an image-text alignment module for report generation and a GPT-based module (Llama 2) for interactive QA. To comprehensively assess the system's performance, we conducted both automatic and manual evaluations. The automatic evaluation encompassed language-based metrics (BLEU, CIDEr, ROUGE, SPICE) and classification-based metrics (accuracy, sensitivity, specificity, precision, F1-score). Additionally, three ophthalmologists participated in a manual assessment, evaluating the completeness and correctness of generated reports, as well as accuracy, completeness, and potential harm of generated answers.Results: Model development leveraged a dataset of 654,343 FFA images from 9,392 participants. In the automatic evaluation of generated reports, our system demonstrated satisfactory performance, yielding scores of BLEU1 = 0.48, BLEU2 = 0.42, BLEU3 = 0.38, BLEU4 = 0.34, CIDEr = 0.33, ROUGE = 0.36, and SPICE = 0.18. Notably, the top five conditions exhibited strong specificity (≥ 0.94) and accuracy (ranging from 0.88 to 0.91), with F1-scores spanning from 0.66 to 0.82. The manual assessment revealed that the generated reports were on par with the ground truth reports, with 68.3% achieving high accuracy and 62.3% achieving high completeness. In the manual QA evaluation, the consensus among the three ophthalmologists was that the majority of answers were characterized by high accuracy, completeness, and safety (70.7% as error-free, 84.0% as complete, and 93.7% as harmless). Notably, substantial agreement was observed among the ophthalmologists both in the evaluation of generated reports and answers, as reflected by kappa values ranging from 0.739 to 0.834.Conclusions: This study introduces an innovative framework that merges multi-modal transformers and LLMs, yielding enhancements in ophthalmic image interpretation. Moreover, the system facilitates dynamic communication between ophthalmologists and patients through interactive capabilities, heralding a new era of collaborative diagnostic processes.
Publisher
Research Square Platform LLC
Reference30 articles.
1. Fundus fluorescein angiography imaging of retinopathy of prematurity in infants: A review;Kvopka M;Survey of Ophthalmology,2023
2. New meaning for NLP: the trials and tribulations of natural language processing with GPT-3 in ophthalmology;Nath S;Br J Ophthalmol,2022
3. Transformers in medical imaging: A survey;Shamshad F;Medical Image Analysis,2023
4. Zhao WX, Zhou K, Li J, et al. A Survey of Large Language Models, http://arxiv.org/abs/2303.18223 (2023, accessed 4 August 2023).
5. Thawkar O, Shaker A, Mullappilly SS, et al. XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models, http://arxiv.org/abs/2306.07971 (2023, accessed 15 August 2023).