ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case

ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case–Based Questions

Published:2023-12-05 Issue: Volume:9 Page:e49183
ISSN:2369-3762
Container-title:JMIR Medical Education
language:en
Short-container-title:JMIR Med Educ

Author:

Buhr Christoph Raphael^ORCID,Smith Harry^ORCID,Huppertz Tilman^ORCID,Bahr-Hamm Katharina^ORCID,Matthias Christoph^ORCID,Blaikie Andrew^ORCID,Kelsey Tom^ORCID,Kuhn Sebastian^ORCID,Eckrich Jonas^ORCID

Abstract

Background Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more “consultations” of LLMs about personal medical symptoms. Objective This study aims to evaluate ChatGPT’s performance in answering clinical case–based questions in otorhinolaryngology (ORL) in comparison to ORL consultants’ answers. Methods We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs. Results Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT’s scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT’s answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001). Conclusions While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants’ answers. LLMs have potential as augmentative tools for medical care, but their “consultation” for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits.

Publisher

JMIR Publications Inc.

Subject

Education

Reference33 articles.

1. ChatGPTOpenAI20212023-11-17https://openai.com/chatgpt

2. Use Chat GPT to Solve Programming Bugs

3. ZielinskiCWinkerMAAggarwalRFerrisLEHeinemannMLapeñaJFJPaiSAIngECitromeLAlamMVoightMHabibzadehFChatbots, generative AI, and scholarly manuscripts: WAME recommendations on chatbots and generative artificial intelligence in relation to scholarly publicationsWAME20232023-11-17WAMEhttps://wame.org/page3.php?id=106

4. GrantNMetzCA new chat bot is a 'code red' for Google's search businessThe New York Times20232023-11-17https://www.nytimes.com/2022/12/21/technology/ai-chatgpt-google-search.html

5. Google buys UK artificial intelligence start-up DeepMindBBC20142023-11-17https://www.bbc.com/news/technology-25908379

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Performance of Large Language Models in Patient Complaint Resolution: Web-Based Cross-Sectional Survey;Journal of Medical Internet Research;2024-08-09

2. Capability of chatbots powered by large language models to support the screening process of scoping reviews: a feasibility study;2024-07-31

3. ChatGPT and Health Communication;International Journal of E-Health and Medical Communications;2024-07-26

4. Large Language Models take on the AAMC Situational Judgment Test: Evaluating Dilemma-Based Scenarios;2024-07-01

5. The future of patient education: A study on AI‐driven responses to urinary incontinence inquiries;International Journal of Gynecology & Obstetrics;2024-06-30