Using Differential Item Functioning to Test for Interrater Reliability in Constructed Response Items-Reference-Cited by-同舟云学术

Using Differential Item Functioning to Test for Interrater Reliability in Constructed Response Items

Published:2020-01-20 Issue:4 Volume:80 Page:808-820
ISSN:0013-1644
Container-title:Educational and Psychological Measurement
language:en
Short-container-title:Educational and Psychological Measurement

Author:

Walker Cindy M.¹^ORCID,Göçer Şahin Sakine²

Affiliation:

1. Duquesne University, Pittsburgh, PA, USA

2. World-Class Instructional Design and Assessment (WIDA) at Wisconsin Center for Educational Research (WCER), Madison, WI, USA

Abstract

The purpose of this study was to investigate a new way of evaluating interrater reliability that can allow one to determine if two raters differ with respect to their rating on a polytomous rating scale or constructed response item. Specifically, differential item functioning (DIF) analyses were used to assess interrater reliability and compared with traditional interrater reliability measures. Three different procedures that can be used as measures of interrater reliability were compared: (1) intraclass correlation coefficient (ICC), (2) Cohen’s kappa statistic, and (3) DIF statistic obtained from Poly-SIBTEST. The results of this investigation indicated that DIF procedures appear to be a promising alternative to assess the interrater reliability of constructed response items, or other polytomous types of items, such as rating scales. Furthermore, using DIF to assess interrater reliability does not require a fully crossed design and allows one to determine if a rater is either more severe, or more lenient, in their scoring of each individual polytomous item on a test or rating scale.

Publisher

SAGE Publications

Subject

Applied Mathematics,Applied Psychology,Developmental and Educational Psychology,Education

Link

http://journals.sagepub.com/doi/pdf/10.1177/0013164419899731

Reference18 articles.

1. A Monte Carlo Comparison of Parametric and Nonparametric Polytomous DIF Detection Methods

2. Detecting DIF for Polytomously Scored Items: An Adaptation of the SIBTEST Procedure

3. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology.

4. A Coefficient of Agreement for Nominal Scales

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Machine learning and deep learning systems for automated measurement of “advanced” theory of mind: Reliability and validity in children and adolescents.;Psychological Assessment;2023-02

2. Implementing a Standardized Effect Size in the POLYSIBTEST Procedure;Educational and Psychological Measurement;2022-02-28

3. Investigating the Distractors to Explain DIF Effects Across Gender in Large-Scale Tests With Non-Linear Logistic Regression Models;Frontiers in Education;2022-01-18

4. Assessing the validity of an IAU General English Achievement Test through hybridizing differential item functioning and differential distractor functioning;Language Testing in Asia;2021-06-16