In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions-Reference-Cited by-同舟云学术

In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions

Published:2024-06-12 Issue:1 Volume:14 Page:
ISSN:2045-2322
Container-title:Scientific Reports
language:en
Short-container-title:Sci Rep

Author:

Knoedler Leonard,Knoedler Samuel,Hoch Cosima C.,Prantl Lukas,Frank Konstantin,Soiderer Laura,Cotofana Sebastian,Dorafshar Amir H.,Schenck Thilo,Vollbach Felix,Sofo Giuseppe,Alfertshofer Michael

Abstract

AbstractChatGPT has garnered attention as a multifaceted AI chatbot with potential applications in medicine. Despite intriguing preliminary findings in areas such as clinical management and patient education, there remains a substantial knowledge gap in comprehensively understanding the chances and limitations of ChatGPT’s capabilities, especially in medical test-taking and education. A total of n = 2,729 USMLE Step 1 practice questions were extracted from the Amboss question bank. After excluding 352 image-based questions, a total of 2,377 text-based questions were further categorized and entered manually into ChatGPT, and its responses were recorded. ChatGPT’s overall performance was analyzed based on question difficulty, category, and content with regards to specific signal words and phrases. ChatGPT achieved an overall accuracy rate of 55.8% in a total number of n = 2,377 USMLE Step 1 preparation questions obtained from the Amboss online question bank. It demonstrated a significant inverse correlation between question difficulty and performance with rs = -0.306; p < 0.001, maintaining comparable accuracy to the human user peer group across different levels of question difficulty. Notably, ChatGPT outperformed in serology-related questions (61.1% vs. 53.8%; p = 0.005) but struggled with ECG-related content (42.9% vs. 55.6%; p = 0.021). ChatGPT achieved statistically significant worse performances in pathophysiology-related question stems. (Signal phrase = “what is the most likely/probable cause”). ChatGPT performed consistent across various question categories and difficulty levels. These findings emphasize the need for further investigations to explore the potential and limitations of ChatGPT in medical examination and education.

Funder

Technische Universität München

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s41598-024-63997-7.pdf

Reference24 articles.

1. Dave, T., Athaluri, S. A. & Singh, S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front. Artif. Intell. https://doi.org/10.3389/frai.2023.1169595 (2023).

2. Sallam, M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare https://doi.org/10.3390/healthcare11060887 (2023).

3. Knoedler, L. et al. A ready-to-use grading tool for facial palsy examiners—Automated grading system in facial palsy patients made easy. J. Pers. Med. https://doi.org/10.3390/jpm12101739 (2022).

4. Knoedler, L. et al. Diagnosing lagophthalmos using artificial intelligence. Sci. Rep. https://doi.org/10.1038/s41598-023-49006-3 (2023).

5. Dave, M. & Patel, N. Artificial intelligence in healthcare and education. Br. Dent. J. 234(10), 761–764. https://doi.org/10.1038/s41415-023-5845-2 (2023).

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Assessment Study of ChatGPT-3.5’s Performance on the Final Polish Medical Examination: Accuracy in Answering 980 Questions;Healthcare;2024-08-16