Affiliation:
1. Educational Testing Service, USA
Abstract
Automated essay scoring can produce reliable scores that are highly correlated with human scores, but is limited in its evaluation of content and other higher-order aspects of writing. The increased use of automated essay scoring in high-stakes testing underscores the need for human scoring that is focused on higher-order aspects of writing. This study experimentally evaluated several alternative procedures for eliciting distinct human scores and improving their reliability. Essays written in response to the argument and issue tasks of the Analytical Writing measure of the GRE General Test were scored by experienced raters under different conditions. Criteria for evaluation included inter-rater agreement, agreement with machine scores, and cross-task reliability. First, the use of a modified scoring rubric that focused on higher-order writing skills increased the reliability for one type of task but decreased it for another. Second, scoring in batches of similar length essays did not have any effect on scores. Third, scoring with available automated essay scores increased reliability of human scores, but also increased their similarity with automated scores. Finally, the use of a more refined 18-point scoring scale significantly increased reliability.
Subject
Linguistics and Language,Social Sciences (miscellaneous),Language and Linguistics
Cited by
36 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献