What Does Affect the Correlation Among Evaluation Measures?-Reference-Cited by-同舟云学术

What Does Affect the Correlation Among Evaluation Measures?

Published:2018-04-30 Issue:2 Volume:36 Page:1-40
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Ferro Nicola¹^ORCID

Affiliation:

1. University of Padua, Padova, Italy

Abstract

Information Retrieval (IR) is well-known for the great number of adopted evaluation measures, with new ones popping up more and more frequently. In this context, correlation analysis is the tool used to study the evaluation measures and to let us understand if two measures rank systems similarly, if they grasp different aspects of system performances or actually reflect different user models, if a new measure is well motivated or not. To this end, the two most commonly used correlation coefficients are the Kendall’s τ correlation and the AP correlation τ AP . The goal of the article is to investigate the properties of the tool, that is, correlation analysis, we use to study evaluation measures. In particular, we investigate three research questions about these two correlation coefficients: (i) what is the effect of the number of systems and topics? (ii) what is the effect of removing low-performing systems? (iii) what is the effect of the experimental collections? To answer these research questions, we propose a methodology based on General Linear Mixed Model (GLMM) and ANalysis Of VAriance (ANOVA) to isolate the effects of the number of topics, number of systems, and experimental collections and to let us observe expected correlation values, net from these effects, which are stable and reliable. We learned that the effect of the number of topics is more prominent than the effect of the number of systems. Even if it produces different absolute values, the effect of removing low-performing systems does not seem to provide information substantially different from not removing them, especially when comparing a whole set of evaluation measures. Finally, we found out that both document corpora and topic sets affect the correlation among evaluation measures, the effect of the latter being more prominent. Moreover, there is a substantial interaction between evaluation measures, corpora and topic sets, meaning that the correlation between different evaluation measures can be substantially increased or decreased depending on the different corpora and topics at hand.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3106371

Reference96 articles.

1. Probabilistic models of information retrieval based on measuring the divergence from randomness

2. Report on the SIGIR 2015 workshop on reproducibility, inexplicability, and generalizability of results (RIGOR);Arguello J.;SIGIR Forum,2015

3. The maximum entropy method for analyzing retrieval measures

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Evaluating Fairness in Argument Retrieval;Proceedings of the 30th ACM International Conference on Information & Knowledge Management;2021-10-26

2. Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales;IEEE Access;2021

3. How to Measure the Reproducibility of System-oriented IR Experiments;Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval;2020-07-25

4. How do interval scales help us with better understanding IR evaluation measures?;Information Retrieval Journal;2019-09-04

5. Learning to Adaptively Rank Document Retrieval System Configurations;ACM Transactions on Information Systems;2019-01-31