Abstract
AbstractObjectiveWe aimed to evaluate the performance of two publicly available large language models, ChatGPT and Google Gemini in response to multiple-choice questions related to vestibular rehabilitation.MethodsThe study was conducted among 30 physical therapist professionals experienced with VR (vestibular rehabilitation) and 30 physical therapy students. They were asked to complete a Vestibular Knowledge Test (VKT) consisting of 20 multiple-choice questions that were divided into three categories: (1) Clinical Knowledge, (2) Basic Clinical Practice, and (3) Clinical Reasoning. ChatGPT and Google Gemini were tasked with answering the same 20 VKT questions. Three board-certified otoneurologists independently evaluated the accuracy of each response using a 4-level scale, ranging from comprehensive to completely incorrect.ResultsChatGPT outperformed Google Gemini with a 70% score on the VKT test, while Gemini scored 60%. Both excelled in Clinical Knowledge with a perfect score of 100% but struggled in Clinical Reasoning with ChatGPT scoring 50% and Gemini scoring 25%. According to three otoneurologic experts, ChatGPT’s accuracy was considered comprehensive in 45% of the 20 questions, while 25% were found to be completely incorrect. ChatGPT provided comprehensive responses in 50% of Clinical Knowledge and Basic Clinical Practice questions, but only 25% in Clinical Reasoning.ConclusionCaution is advised when using ChatGPT and Google Gemini due to their limited accuracy in clinical reasoning. While they provide accurate responses concerning Clinical Knowledge, their reliance on web information may lead to inconsistencies. ChatGPT performed better than Gemini. Healthcare professionals should carefully formulate questions and be aware of the potential influence of the online prevalence of information on ChatGPT’s and Google Gemini’s responses. Combining clinical expertise and clinical guidelines with ChatGPT and Google Gemini can maximize benefits while mitigating limitations.Impact StatementThis study highlights the potential utility of large language models like ChatGPT in supplementing clinical knowledge for physical therapists, while underscoring the need for caution in domains requiring complex clinical reasoning. The findings emphasize the importance of integrating technological tools carefully with human expertise to enhance patient care and rehabilitation outcomes.
Publisher
Cold Spring Harbor Laboratory