Docimological Quality Analysis of LLM-Generated Multiple Choice Questions in Computer Science and Medicine-Reference-Cited by-同舟云学术

Docimological Quality Analysis of LLM-Generated Multiple Choice Questions in Computer Science and Medicine

Published:2024-06-10 Issue:5 Volume:5 Page:
ISSN:2661-8907
Container-title:SN Computer Science
language:en
Short-container-title:SN COMPUT. SCI.

Author:

Grévisse Christian^ORCID,Pavlou Maria Angeliki S.,Schneider Jochen G.

Abstract

AbstractAssessment is an essential part of education, both for teachers who assess their students as well as learners who may evaluate themselves. Multiple-choice questions (MCQ) are one of the most popular types of knowledge assessment, e.g., in medical education, as they can be automatically graded and can cover a wide range of learning items. However, the creation of high-quality MCQ items is a time-consuming task. The recent advent of Large Language Models (LLM), such as Generative Pre-trained Transformer (GPT), caused a new momentum for automatic question generation solutions. Still, evaluating generated questions according to the best practices for MCQ item writing is needed to ensure docimological quality. In this article, we propose an analysis of the quality of LLM-generated MCQs. We employ zero-shot approaches in two domains, namely computer science and medicine. In the former, we make use of 3 GPT-based services to generate MCQs. In the latter, we developed a plugin for the Moodle learning management system that generates MCQs based on learning material. We compare the generated MCQs against common multiple-choice item writing guidelines. Among the major challenges, we determined that while LLMs are certainly useful in generating MCQs more efficiently, they sometimes create broad items with ambiguous keys or implausible distractors. Human oversight is also necessary to ensure instructional alignment between generated items and course contents. Finally, we propose solutions for AQG developers.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s42979-024-02963-6.pdf

Reference30 articles.

1. Bloom BS. Taxonomy of educational objectives: the classification of educational goals. Boston: Allyn and Bacon; 1956.

2. Miller GE. The assessment of clinical skills/competence/performance. Acad Med. 1990;65:63–7.

3. Bertrand C. et al. In: Pelaccia T (ed) Choisir un outil d’évaluationComment (mieux) former et évaluer les étudiants en médecine et en sciences de la santé? De Boeck Supérieur. 2016. pp. 357–370

4. Cheung BHH et al. ChatGPT versus human in generating medical graduate exam multiple choice questions-A multinational prospective study. In: Hong Kong SAR, Singapore, Ireland, and the United Kingdom. PLOS ONE 2023;18:1–12 .

5. Doughty J. et al. A comparative study of AI-generated (GPT-4) and human-crafted MCQs in programming education. In: Herbert N, Seton C, editors. Proceedings of the 26th Australasian Computing Education Conference, ACE ’24. New York:Association for Computing Machinery. 2024. p. 114–123