Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study-Reference-Cited by-同舟云学术

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Published:2024-02-08 Issue: Volume:10 Page:e50965
ISSN:2369-3762
Container-title:JMIR Medical Education
language:en
Short-container-title:JMIR Med Educ

Author:

Meyer Annika^ORCID,Riese Janik^ORCID,Streichert Thomas^ORCID

Abstract

Background The potential of artificial intelligence (AI)–based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. Objective This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. Methods To assess GPT-3.5’s and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. Results GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. Conclusions The study results highlight ChatGPT’s remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4’s predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.

Publisher

JMIR Publications Inc.

Reference59 articles.

1. ChatGPT and Other Large Language Models Are Double-edged Swords

2. ChatGPT: friend or foe?

3. Generating scholarly content with ChatGPT: ethical challenges for medical publishing

4. ChatGPT: the future of discharge summaries?

5. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance;Information;2024-09-05

2. Understanding model power in social AI;AI & SOCIETY;2024-08-14

3. Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis;Journal of Medical Internet Research;2024-07-25

4. The potential of ChatGPT in medicine: an example analysis of nephrology specialty exams in Poland;Clinical Kidney Journal;2024-06-22

5. Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum;Clinical Chemistry and Laboratory Medicine (CCLM);2024-05-29