Affiliation:
1. Educational Testing Service, Princeton, NJ, USA
Abstract
From their earliest origins, automated essay scoring systems strived to emulate human essay scores and viewed them as their ultimate validity criterion. Consequently, the importance (or weight) and even identity of computed essay features in the composite machine score were determined by statistical techniques that sought to optimally predict human scores from essay features. However, it is evident that machine evaluation of essays is fundamentally different from human evaluation and therefore is not likely to measure the same set of writing skills. As a consequence, feature weights of human-prediction machine scores (reflecting their importance in the composite scores) are bound to reflect statistical artifacts. This article suggests alternative feature weighting schemes based on the premise of maximizing reliability and internal consistency of the composite score. The article shows, in the context of a large-scale writing assessment, that these alternative weighting schemes are significantly different from human-prediction weights and give rise to comparable or even superior reliability and validity coefficients.
Subject
Psychology (miscellaneous),Social Sciences (miscellaneous)
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献