Author:
Helbling Laura A.,Berger Stéphanie,Verschoor Angela
Abstract
Multistage test (MST) designs promise efficient student ability estimates, an indispensable asset for individual diagnostics in high-stakes educational assessments. In high-stakes testing, annually changing test forms are required because publicly known test items impair accurate student ability estimation, and items of bad model fit must be continually replaced to guarantee test quality. This requires a large and continually refreshed item pool as the basis for high-stakes MST. In practice, the calibration of newly developed items to feed annually changing tests is highly resource intensive. Piloting based on a representative sample of students is often not feasible, given that, for schools, participation in actual high-stakes assessments already requires considerable organizational effort. Hence, under practical constraints, the calibration of newly developed items may take place on the go in the form of a concurrent calibration in MST designs. Based on a simulation approach this paper focuses on the performance of Rasch vs. 2PL modeling in retrieving item parameters when items are for practical reasons non-optimally placed in multistage tests. Overall, the results suggest that the 2PL model performs worse in retrieving item parameters compared to the Rasch model when there is non-optimal item assembly in the MST; especially in retrieving parameters at the margins. The higher flexibility of 2PL modeling, where item discrimination is allowed to vary, seems to come at the cost of increased volatility in parameter estimation. Although the overall bias may be modest, single items can be affected by severe biases when using a 2PL model for item calibration in the context of non-optimal item placement.
Reference45 articles.
1. An Item-Driven Adaptive Design for Calibrating Pretest Items
2. Too Hard, Too Easy, or Just Right? the Relationship between Effort or Boredom and Ability-Difficulty Fit;Asseburg;Psychol. Test Assess. Model.,2013
3. Subject Matter Experts' Assessment of Item Statistics;Bejar,1983
4. On the Efficiency of IRT Models When Applied to Different Sampling Designs;Berger,1991