A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study-Reference-Cited by-同舟云学术

A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study

Published:2024-01-16 Issue: Volume:10 Page:e49970
ISSN:2369-3762
Container-title:JMIR Medical Education
language:en
Short-container-title:JMIR Med Educ

Author:

Long Cai^ORCID,Lowe Kayle^ORCID,Zhang Jessica^ORCID,Santos André dos^ORCID,Alanazi Alaa^ORCID,O'Brien Daniel^ORCID,Wright Erin D^ORCID,Cote David^ORCID

Abstract

Background ChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. Objective We aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model’s performance on open-ended medical board examination questions. Methods Twenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada’s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. Results In an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. Conclusions ChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation.

Publisher

JMIR Publications Inc.

Subject

Education

Reference20 articles.

1. VaranasiLAI models like ChatGPT and GPT-4 are acing everything from the bar exam to AP Biology. Here's a list of difficult exams both AI versions have passedBusiness Insider20233212023-05-24https://www.businessinsider.com/list-here-are-the-exams-chatgpt-has-passed-so-far-2023-1

2. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

3. Performance of ChatGPT on the MCAT: The Road to Personalized and Equitable Premedical Learning

4. The Utility of ChatGPT as an Example of Large Language Models in Healthcare Education, Research and Practice: Systematic Review on the Future Perspectives and Potential Limitations

5. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education?