Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study-Reference-Cited by-同舟云学术

Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

Published:2024-01-23 Issue: Volume:26 Page:e52113
ISSN:1438-8871
Container-title:Journal of Medical Internet Research
language:en
Short-container-title:J Med Internet Res

Author:

Herrmann-Werner Anne^ORCID,Festl-Wietek Teresa^ORCID,Holderried Friederike^ORCID,Herschbach Lea^ORCID,Griewatz Jan^ORCID,Masters Ken^ORCID,Zipfel Stephan^ORCID,Mahling Moritz^ORCID

Abstract

Background Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom’s taxonomy. Objective This study aims to explore how GPT-4 performs in terms of Bloom’s taxonomy using psychosomatic medicine exam questions. Methods We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom’s taxonomy. Results GPT-4’s performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4’s lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom’s taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom’s taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

Publisher

JMIR Publications Inc.

Reference36 articles.

1. Explainability and artificial intelligence in medicine

2. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers

3. ChatGPT: optimizing language models for dialogueOpenAI20222023-08-07https://web.archive.org/web/20221130180912/https://openai.com/blog/chatgpt/

4. Ethical use of Artificial Intelligence in Health Professions Education: AMEE Guide No. 158

5. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Category Mapping of Emergency Supplies Classification Standard Based on BERT-TextCNN;Systems;2024-09-10

2. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance;Information;2024-09-05

3. A Language Model–Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study;JMIR Medical Education;2024-08-16

4. Optimizing Human–AI Collaboration in Chemistry: A Case Study on Enhancing Generative AI Responses through Prompt Engineering;Chemistry;2024-08-11

5. The Impact of Aligning Artificial Intelligence Large Language Models With Bloom's Taxonomy in Healthcare Education;Advances in Business Information Systems and Analytics;2024-06-30