Assessing ChatGPT’s Mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions-Reference-Cited by-同舟云学术

Assessing ChatGPT’s Mastery of Bloom’s Taxonomy using psychosomatic medicine exam questions

Published:2023-08-21 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Herrmann-Werner Anne^ORCID,Festl-Wietek Teresa,Holderried Friederike,Herschbach Lea,Griewatz Jan,Masters Ken,Zipfel Stephan,Mahling Moritz^ORCID

Abstract

AbstractIntroductionLarge language models (LLMs) such as GPT-4 are increasingly used in medicine and medical education. However, these models are prone to “hallucinations” – outputs that sound convincing while being factually incorrect. It is currently unknown how these errors by LLMs relate to the different cognitive levels defined in Bloom’s Taxonomy.MethodsWe used a large dataset of psychosomatic medicine multiple-choice questions (MCQ) (N = 307) with real-world results derived from medical school exams. GPT-4 answered the MCQs using two distinct prompt versions – detailed and short. The answers were analysed using a quantitative and qualitative approach. We focussed on incorrectly answered questions, categorizing reasoning errors according to Bloom’s Taxonomy.ResultsGPT-4’s performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty compared to questions that GPT-4 answered incorrectly (p=0.002 for the detailed prompt and p<0.001 for the short prompt). Independent of the prompt, GPT-4’s lowest exam performance was 78.9%, always surpassing the pass threshold. Our qualitative analysis of incorrect answers, based on Bloom’s Taxonomy, showed errors mainly in the “remember” (29/68) and “understand” (23/68) cognitive levels. Specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines.DiscussionGPT-4 displayed a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated against Bloom’s hierarchical framework, our data revealed that GPT-4 occasionally ignored specific facts (“remember”), provided illogical reasoning (“understand”), or failed to apply concepts to a new situation (“apply”). These errors, though confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.ConclusionWhile GPT-4 mostly excels at medical exam questions, discerning its occasional cognitive errors is crucial.

Publisher

Cold Spring Harbor Laboratory

Reference32 articles.

1. Explainability and artificial intelligence in medicine;1;Lancet Digit Health,2022

2. The Role of ChatGPT;6;Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Med Educ,2023

3. ChatGPT: Optimizing Language Models for Dialogue [Internet]. 2022 [cited 2023 Aug 7]. Available from: https://web.archive.org/web/20221130180912/ https://openai.com/blog/chatgpt/

4. Ethical use of Artificial Intelligence in Health Professions Education: AMEE Guide No. 158;Jun 3;Med Teach.,2023

5. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5, and Humans in Clinical Chemistry Multiple-Choice Questions;2024-01-23

2. Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5, and Humans in Clinical Chemistry Multiple-Choice Questions;2024-01-09

3. BloomLLM: Large Language Models Based Question Generation Combining Supervised Fine-Tuning and Bloom’s Taxonomy;Lecture Notes in Computer Science;2024

4. Below average ChatGPT performance in medical microbiology exam compared to university students;Frontiers in Education;2023-12-21