On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation-Reference-Cited by-同舟云学术

On Crowdsourcing Relevance Magnitudes for Information Retrieval Evaluation

Published:2017-07-31 Issue:3 Volume:35 Page:1-32
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Maddalena Eddy¹,Mizzaro Stefano¹,Scholer Falk²,Turpin Andrew³^ORCID

Affiliation:

1. University of Udine, Italy

2. RMIT University, Australia

3. University of Melbourne, Australia

Abstract

Magnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents for information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting over 50,000 magnitude estimation judgments using crowdsourcing. Our analysis shows that magnitude estimation judgments can be reliably collected using crowdsourcing, are competitive in terms of assessor cost, and are, on average, rank-aligned with ordinal judgments made by expert relevance assessors. We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance: different users have different perceptions of the impact of relative differences in document relevance. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold.

Funder

Australian Research Council

Google Faculty Research Award

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3002172

Reference44 articles.

1. Using crowdsourcing for TREC relevance assessment

2. An analysis of crowd workers mistakes for specific and complex relevance assessment task

3. Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation

4. Magnitude Estimation of Linguistic Acceptability

5. Learning to rank using gradient descent

Cited by 41 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Evaluating Generative Ad Hoc Information Retrieval;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10

2. Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10

3. Reliable Information Retrieval Systems Performance Evaluation: A Review;IEEE Access;2024

4. Optimizing Task Distribution Systems: A Comparative Study of Micro-Task Job Replication, Accuracy, and Budget Constraints;2023 International Conference on Network, Multimedia and Information Technology (NMITCON);2023-09-01

5. How Many Crowd Workers Do I Need? On Statistical Power when Crowdsourcing Relevance Judgments;ACM Transactions on Information Systems;2023-08-18