Affiliation:
1. University of Delaware, Newark, DE
Abstract
High-quality reusable test collections and formal statistical hypothesis testing together support a rigorous experimental environment for information retrieval research. But as Armstrong et al. [2009b] recently argued, global analysis of experiments suggests that there has actually been little real improvement in ad hoc retrieval effectiveness over time. We investigate this phenomenon in the context of simultaneous testing of many hypotheses using a fixed set of data. We argue that the most common approaches to significance testing ignore a great deal of information about the world. Taking into account even a fairly small amount of this information can lead to very different conclusions about systems than those that have appeared in published literature. We demonstrate how to model a set of IR experiments for analysis both mathematically and practically, and show that doing so can causep-values from statistical hypothesis tests to increase by orders of magnitude. This has major consequences on the interpretation of experimental results using reusable test collections: it is very difficult to conclude thatanythingis significant once we have modeled many of the sources of randomness in experimental design and analysis.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Cited by
87 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Uncontextualized significance considered dangerous;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
2. Multiple Testing for IR and Recommendation System Experiments;Lecture Notes in Computer Science;2024
3. An Intrinsic Framework of Information Retrieval Evaluation Measures;Lecture Notes in Networks and Systems;2024
4. Chuweb21D: A Deduped English Document Collection for Web Search Tasks;Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region;2023-11-26
5. How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods;Proceedings of the 32nd ACM International Conference on Information and Knowledge Management;2023-10-21