Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance-Reference-Cited by-同舟云学术

Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance

Published:2024-07-29 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Samaan Jamil S.,Margolis Samuel^ORCID,Srinivasan Nitin,Srinivasan Apoorva,Yeo Yee Hui,Anand Rajsavi,Samaan Fadi S.,Mirocha James,Safavi-Naini Seyed Amir Ahmad,El Kurdi Bara,Soroush Ali,Watson Rabindra,Gaddam Srinivas,Elmore Joann G.,Spiegel Brennan M.R.,Tatonetti Nicholas P.

Abstract

ABSTRACTBackgroundLarge language models (LLMs) have shown promise in answering medical licensing examination-style questions. However, there is limited research on the performance of multimodal LLMs on subspecialty medical examinations. Our study benchmarks the performance of multimodal LLM’s enhanced by model prompting strategies on gastroenterology subspeciality examination-style questions and examines how these prompting strategies incrementally improve overall performance.MethodsWe used the 2022 American College of Gastroenterology (ACG) self-assessment examination (N=300). This test is typically completed by gastroenterology fellows and established gastroenterologists preparing for the gastroenterology subspeciality board examination. We employed a sequential implementation of model prompting strategies: prompt engineering, retrieval augmented generation (RAG), five-shot learning, and an LLM-powered answer validation revision model (AVRM). GPT-4 and Gemini Pro were tested.ResultsImplementing all prompting strategies improved the overall score of GPT-4 from 60.3% to 80.7% and Gemini Pro’s from 48.0% to 54.3%. GPT-4’s score surpassed the 70% passing threshold and 75% average human test-taker scores unlike Gemini Pro. Stratification of questions by difficulty showed the accuracy of both LLMs mirrored that of human examinees, demonstrating higher accuracy as human test-taker accuracy increased. The addition of the AVRM to prompt, RAG and 5-shot increased GPT-4’s accuracy by 4.4%. The incremental addition of model prompting strategies improved accuracy for both non-image (57.2% to 80.4%) and image-based (63.0% to 80.9%) questions for GPT-4, but not Gemini Pro.ConclusionsOur results underscore the value of model prompting strategies in improving LLM performance on subspecialty-level licensing exam questions. We also present a novel implementation of an LLM-powered reviewer model in the context of subspecialty medicine which further improved model performance when combined with other prompting strategies. Our findings highlight the potential future role of multimodal LLMs, particularly with the implementation of multiple model prompting strategies, as clinical decision support systems in subspecialty care for healthcare providers.

Publisher

Cold Spring Harbor Laboratory

Reference39 articles.

1. Leveraging Large Language Models for Decision Support in Personalized Oncology

2. Augmented non-hallucinating large language models as medical information curators

3. Clinical decision support for bipolar depression using large language models

4. Large language models: a primer and gastroenterology applications