Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany-Reference-Cited by-同舟云学术

Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany

Published:2023-09-04 Issue: Volume:9 Page:e46482
ISSN:2369-3762
Container-title:JMIR Medical Education
language:en
Short-container-title:JMIR Med Educ

Author:

Roos Jonas^ORCID,Kasapovic Adnan^ORCID,Jansen Tom^ORCID,Kaczmarczyk Robert^ORCID

Abstract

Background Large language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation. Objective This study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students. Methods The LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated. Results GPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty. Conclusions LLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape.

Publisher

JMIR Publications Inc.

Subject

Education

Reference26 articles.

1. Medical education in Germany

2. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns

3. Foundation models for generalist medical artificial intelligence

4. Introducing ChatGPTOpenAI2023-05-08https://openai.com/blog/chatgpt

5. ArXiv

Cited by 34 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Effectiveness of ChatGPT in remote learning environments: An empirical study with medical students in Saudi Arabia;Nutrition and Health;2024-08-16

2. Large Language Models in Pediatric Education: Current Uses and Future Potential;Pediatrics;2024-08-07

3. Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications;International Journal of Medical Informatics;2024-08

4. Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control;Journal of Multidisciplinary Healthcare;2024-08

5. Current Research and Future Directions for Off-Site Construction through LangChain with a Large Language Model;Buildings;2024-08-01