Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions-Reference-Cited by-同舟云学术

Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions

Published:2023-12-28 Issue:5 Volume:99 Page:508-512
ISSN:1040-2446
Container-title:Academic Medicine
language:en
Short-container-title:Acad Med

Author:

Laupichler Matthias Carl^ORCID,Rother Johanna Flora^ORCID,Grunwald Kadow Ilona C.^ORCID,Ahmadi Seifollah,Raupach Tobias^ORCID

Abstract

Abstract Problem Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students’ performance on LLM-generated questions to questions developed by humans. Approach The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. Outcomes The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. Next Steps Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.

Publisher

Ovid Technologies (Wolters Kluwer Health)

Reference9 articles.

1. The critical importance of retrieval for learning;Science,2008

2. Testing the testing effect in the classroom;Eur J Cogn Psychol,2007

3. Transfer of test-enhanced learning: meta-analytic review and synthesis;Psychol Bull,2018

4. How to write a high quality multiple choice question (MCQ): a guide for clinicians;Eur J Vasc Endovasc Surg,2017

5. Large language models in medicine;Nat Med,2023

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Awareness and Attitudes of Chinese Medical Students Towards the Application of Large Language Models in Medicine: A Cross-Sectional Survey Study (Preprint);2024-09-11

2. Can a large language model create acceptable dental board-style examination questions? A cross-sectional prospective study;Journal of Dental Sciences;2024-09

3. Utility of large language models for creating clinical assessment items;Medical Teacher;2024-08-26

4. Beginner-Level Tips for Medical Educators: Guidance on Selection, Prompt Engineering, and the Use of Artificial Intelligence Chatbots;Medical Science Educator;2024-08-17

5. Using a hybrid of artificial intelligence and template-based method in automatic item generation to create multiple-choice questions in medical education: Hybrid AIG;2024-07-15