Affiliation:
1. Institute for Psychology of Learning and Instruction, Kiel University
2. University of Hildesheim
3. University of Applied Sciences and Arts Northwestern Switzerland
4. Zurich University of Teacher Education
5. Leibniz Institute for Science and Mathematics Education
Abstract
Abstract
Recent investigations in automated essay scoring research imply that hybrid models, which combine feature engineering and the powerful tools of deep neural networks (DNNs), reach state-of-the-art performance. However, most of these findings are from holistic scoring tasks. In the present study, we use a total of four prompts from two different corpora consisting of both L1 and L2 learner essays annotated with three trait scores (e.g., content, organization and language quality). In our main experiments, we compare three variants of trait-specific models using different inputs: (1) models based on 220 linguistic features, (2) models using essay-level contextual embeddings from the distilled version of the pre-trained transformer BERT (DistilBERT), and (3) a hybrid model using both types of features. Results imply that when trait-specific models are trained based on a single-resource, the feature-based models slightly outperform the embedding-based models. These differences are most prominent for the organization traits. The hybrid models outperform the single-resource models, indicating that linguistic features and embeddings indeed capture partially different aspects relevant for the assessment of essay traits. To gain more insights into the interplay between both feature types, we run ablation tests for single feature groups. Trait-specific ablation tests across prompts indicate that the embedding-based models can most consistently be enhanced in content assessment when combined with morphological complexity features. Most consistent performance gains in the organization traits are achieved when embeddings are combined with length features, and most consistent performance gains in the assessment of the language traits when combined with lexical complexity, error, and occurrence features. Cross-prompt scoring again reveals slight advantages for the feature-based models.
Publisher
Research Square Platform LLC
Reference57 articles.
1. Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic Text Scoring Using Neural Networks. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 715–725). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1068.
2. Andrade, H. L. (2018). Feedback in the Context of Self-Assessment. In A. A. Lipnevich, & J. K. Smith (Eds.), The Cambridge Handbook of Instructional Feedback (pp. 376–408). Cambridge University Press. https://doi.org/10.1017/9781316832134.019.
3. A Survey of Current Machine Learning Approaches to Student Free-Text Evaluation for Intelligent Tutoring;Bai X;International Journal of Artificial Intelligence in Education,2022
4. Random search for hyper-parameter optimization;Bergstra J;Journal of machine learning research,2012
5. An Empirical Analysis of BERT Embedding for Automated Essay Scoring;Beseiso M;International Journal of Advanced Computer Science and Applications,2020