Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis-Reference-Cited by-同舟云学术

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

Published:2024-01-05 Issue: Volume:10 Page:e51148
ISSN:2369-3762
Container-title:JMIR Medical Education
language:en
Short-container-title:JMIR Med Educ

Author:

Knoedler Leonard^ORCID,Alfertshofer Michael^ORCID,Knoedler Samuel^ORCID,Hoch Cosima C^ORCID,Funk Paul F^ORCID,Cotofana Sebastian^ORCID,Maheta Bhagvat^ORCID,Frank Konstantin^ORCID,Brébant Vanessa^ORCID,Prantl Lukas^ORCID,Lamby Philipp^ORCID

Abstract

Background The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT’s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. Objective This paper aimed to analyze ChatGPT’s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. Methods A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. Results Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=–0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=–0.289 for ChatGPT 3.5 and ρ=–0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. Conclusions In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.

Publisher

JMIR Publications Inc.

Reference18 articles.

1. Investigating the Relationship Between a Clinical Science Composite Score and USMLE Step 2 Clinical Knowledge and Step 3 Performance

2. Study Behaviors and USMLE Step 1 Performance

3. Medical Students’ Reflections on the Recent Changes to the USMLE Step Exams

4. USMLE Step 3 Scores Have Value in Predicting ABR Core Examination Outcome and Performance: A Multi-institutional Study

5. The association of USMLE Step 1 and Step 2 CK scores with residency match specialty and location

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Transforming Health Care Landscapes: The Lever of Radiology Research and Innovation on Emerging Markets Poised for Aggressive Growth;Journal of the American College of Radiology;2024-08

2. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis;Journal of Medical Internet Research;2024-07-25

3. Performance of Advanced Large Language Models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese Medical Licensing Examination: A Comparative Study;2024-07-09

4. “Pseudo” Intelligence or Misguided or Mis-sourced Intelligence?;The Annals of Thoracic Surgery;2024-07

5. The performance of large language models in managing abnormal results of cervical cancer screening: Comparative Study (Preprint);2024-06-25