Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions

Author:

Newton Philip M.ORCID,Summers Christopher J.,Zaheer Uzman,Xiromeriti Maira,Stokes Jemima R.,Bhangu Jaskaran Singh,Roome Elis G.,Roberts-Phillips Alanna,Mazaheri-Asadi Darius,Jones Cameron D.,Hughes Stuart,Gilbert Dominic,Jones Ewan,Essex Keioni,Ellis Emily C.,Davey Ross,Cox Adrienne A.,Bassett Jessica A.

Abstract

AbstractChatGPT apparently shows excellent performance on high level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has also shown weaker performance on questions with pictures, and there have been concerns that ChatGPT’s performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested.Here we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams.ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test, and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show a slightly reduced performance on questions containing images, particularly when the answer options were added to an image as text labels.These data demonstrate that the performance of ChatGPT continues to improve and that online unproctored exams are an invalid form of assessment of the foundational knowledge needed for higher order learning.

Publisher

Cold Spring Harbor Laboratory

Reference39 articles.

1. ChatGPT performance on multiple choice question examinations in higher education. A pragmatic scoping review;Assess Eval High Educ,2024

2. ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines;Med Sci Educ,2024

3. Lai UH , Wu KS , Hsu TY , Kan JKC . Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment. Front Med. 2023 Sep 19;10:1240915.

4. Billings M , DeRuchie K , Hussie K , Kulesher A , Merrell J , Morales A , et al. Constructing written test questions for the Health Sciences [Internet]. National Board of Medical Examiners; 2020 [cited 2022 Apr 7]. Available from: https://www.nbme.org/sites/default/files/2020-11/NBME_Item%20Writing%20Guide_2020.pdf

5. Guidelines for Creating Online MCQ-Based Exams to Evaluate Higher Order Learning and Reduce Academic Misconduct

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3