Abstract
AbstractIntroductionLarge language models (LLMs) such as GPT-4 are increasingly used in medicine and medical education. However, these models are prone to “hallucinations” – outputs that sound convincing while being factually incorrect. It is currently unknown how these errors by LLMs relate to the different cognitive levels defined in Bloom’s Taxonomy.MethodsWe used a large dataset of psychosomatic medicine multiple-choice questions (MCQ) (N = 307) with real-world results derived from medical school exams. GPT-4 answered the MCQs using two distinct prompt versions – detailed and short. The answers were analysed using a quantitative and qualitative approach. We focussed on incorrectly answered questions, categorizing reasoning errors according to Bloom’s Taxonomy.ResultsGPT-4’s performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty compared to questions that GPT-4 answered incorrectly (p=0.002 for the detailed prompt and p<0.001 for the short prompt). Independent of the prompt, GPT-4’s lowest exam performance was 78.9%, always surpassing the pass threshold. Our qualitative analysis of incorrect answers, based on Bloom’s Taxonomy, showed errors mainly in the “remember” (29/68) and “understand” (23/68) cognitive levels. Specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines.DiscussionGPT-4 displayed a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated against Bloom’s hierarchical framework, our data revealed that GPT-4 occasionally ignored specific facts (“remember”), provided illogical reasoning (“understand”), or failed to apply concepts to a new situation (“apply”). These errors, though confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.ConclusionWhile GPT-4 mostly excels at medical exam questions, discerning its occasional cognitive errors is crucial.
Publisher
Cold Spring Harbor Laboratory
Reference32 articles.
1. Explainability and artificial intelligence in medicine;1;Lancet Digit Health,2022
2. The Role of ChatGPT;6;Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Med Educ,2023
3. ChatGPT: Optimizing Language Models for Dialogue [Internet]. 2022 [cited 2023 Aug 7]. Available from: https://web.archive.org/web/20221130180912/ https://openai.com/blog/chatgpt/
4. Ethical use of Artificial Intelligence in Health Professions Education: AMEE Guide No. 158;Jun 3;Med Teach.,2023
5. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献