Abstract
Background: Many methods under the umbrella of inter-rater agreement (IRA) have been proposed to evaluate how well two or more medical experts agree on a set of outcomes. The objective of this work was to assess key IRA statistics in the context of multiple raters with binary outcomes. Methods: We simulated the responses of several raters (2–5) with 20, 50, 300, and 500 observations. For each combination of raters and observations, we estimated the expected value and variance of four commonly used inter-rater agreement statistics (Fleiss’ Kappa, Light’s Kappa, Conger’s Kappa, and Gwet’s AC1). Results: In the case of equal outcome prevalence (symmetric), the estimated expected values of all four statistics were equal. In the asymmetric case, only the estimated expected values of the three Kappa statistics were equal. In the symmetric case, Fleiss’ Kappa yielded a higher estimated variance than the other three statistics. In the asymmetric case, Gwet’s AC1 yielded a lower estimated variance than the three Kappa statistics for each scenario. Conclusion: Since the population-level prevalence of a set of outcomes may not be known a priori, Gwet’s AC1 statistic should be favored over the three Kappa statistics. For meaningful direct comparisons between IRA measures, transformations between statistics should be conducted.
Funder
Natural Sciences and Engineering Research Council
Subject
Physics and Astronomy (miscellaneous),General Mathematics,Chemistry (miscellaneous),Computer Science (miscellaneous)
Cited by
17 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献