Measures of Agreement with Multiple Raters: Fréchet Variances and Inference-Reference-Cited by-同舟云学术

Measures of Agreement with Multiple Raters: Fréchet Variances and Inference

Published:2024-01-08 Issue:2 Volume:89 Page:517-541
ISSN:0033-3123
Container-title:Psychometrika
language:en
Short-container-title:Psychometrika

Author:

Moss Jonas^ORCID

Abstract

AbstractMost measures of agreement are chance-corrected. They differ in three dimensions: their definition of chance agreement, their choice of disagreement function, and how they handle multiple raters. Chance agreement is usually defined in a pairwise manner, following either Cohen’s kappa or Fleiss’s kappa. The disagreement function is usually a nominal, quadratic, or absolute value function. But how to handle multiple raters is contentious, with the main contenders being Fleiss’s kappa, Conger’s kappa, and Hubert’s kappa, the variant of Fleiss’s kappa where agreement is said to occur only if every rater agrees. More generally, multi-rater agreement coefficients can be defined in a g-wise way, where the disagreement weighting function uses g raters instead of two. This paper contains two main contributions. (a) We propose using Fréchet variances to handle the case of multiple raters. The Fréchet variances are intuitive disagreement measures and turn out to generalize the nominal, quadratic, and absolute value functions to the case of more than two raters. (b) We derive the limit theory of g-wise weighted agreement coefficients, with chance agreement of the Cohen-type or Fleiss-type, for the case where every item is rated by the same number of raters. Trying out three confidence interval constructions, we end up recommending calculating confidence intervals using the arcsine transform or the Fisher transform.

Funder

Norwegian Business School

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11336-023-09945-2.pdf

Reference46 articles.

1. Berry, K. J., Johnston, J. E., & Mielke, P. W., Jr. (2008). Weighted kappa for multiple raters. Perceptual and Motor Skills, 107(3), 837–848. https://doi.org/10.2466/pms.107.3.837-848

2. Berry, K. J., & Mielke, P. W. (1988). A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48(4), 921–933. https://doi.org/10.1177/0013164488484007

3. Carrasco, J. L., & Jover, L. (2003). Estimating the generalized concordance correlation coefficient through variance components. Biometrics, 59(4), 849–858. https://doi.org/10.1111/j.0006-341x.2003.00099.x

4. Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558. https://doi.org/10.1016/0895-4356(90)90159-m

5. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104