Evaluation of LLM Tools for Feedback Generation in a Course on Concurrent Programming-Reference-Cited by-同舟云学术

Evaluation of LLM Tools for Feedback Generation in a Course on Concurrent Programming

Published:2024-05-15 Issue: Volume: Page:
ISSN:1560-4292
Container-title:International Journal of Artificial Intelligence in Education
language:en
Short-container-title:Int J Artif Intell Educ

Author:

Estévez-Ayres Iria^ORCID,Callejo Patricia^ORCID,Hombrados-Herrera Miguel Ángel^ORCID,Alario-Hoyos Carlos^ORCID,Delgado Kloos Carlos^ORCID

Abstract

AbstractThe emergence of Large Language Models (LLMs) has marked a significant change in education. The appearance of these LLMs and their associated chatbots has yielded several advantages for both students and educators, including their use as teaching assistants for content creation or summarisation. This paper aims to evaluate the capacity of LLMs chatbots to provide feedback on student exercises in a university programming course. The complexity of the programming topic in this study (concurrency) makes the need for feedback to students even more important. The authors conducted an assessment of exercises submitted by students. Then, ChatGPT (from OpenAI) and Bard (from Google) were employed to evaluate each exercise, looking for typical concurrency errors, such as starvation, deadlocks, or race conditions. Compared to the ground-truth evaluations performed by expert teachers, it is possible to conclude that none of these two tools can accurately assess the exercises despite the generally positive reception of LLMs within the educational sector. All attempts result in an accuracy rate of 50%, meaning that both tools have limitations in their ability to evaluate these particular exercises effectively, specifically finding typical concurrency errors.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s40593-024-00406-0.pdf

Reference55 articles.

1. Afzaal, M., Nouri, J., Zia, A., Papapetrou, P., Fors, U., Wu, Y., Li, X., & Weegar, R. (2021). Explainable ai for data-driven feedback and intelligent action recommendations to support students self-regulation. Frontiers in Artificial Intelligence., 4, 723447.

2. Ala-Mutka, K. M. (2005). A survey of automated assessment approaches for programming assignments. Computer Science Education, 15(2), 83–102.

3. Barros, M., Ramos, M., Gomes, A., Cunha, A., Pereira, J., & Almeida, P. S. (2023). An experimental evaluation of tools for grading concurrent programming exercises. In M. Huisman & A. Ravara (Eds.), Formal Techniques for Distributed Objects, Components, and Systems (pp. 3–20). Cham: Springer.

4. Blackshear, S., Gorogiannis, N., O’Hearn, P. W., & Sergey, I. (2018). Racerd: compositional static race detection. Proceedings of the ACM on Programming Languages,2(OOPSLA). https://doi.org/10.1145/3276514

5. Butler, D. L., & Winne, P. H. (1995). Feedback and self-regulated learning: A theoretical synthesis. Review of Educational Research, 65(3), 245–281.