Affiliation:
1. The University of Melbourne, Victoria, Australia
2. Microsoft, Barton, ACT, Australia
3. RMIT University, Victoria, Australia
Abstract
Information retrieval systems aim to help users satisfy information needs. We argue that the goal of the person using the system, and the pattern of behavior that they exhibit as they proceed to attain that goal, should be incorporated into the methods and techniques used to evaluate the effectiveness of IR systems, so that the resulting effectiveness scores have a useful interpretation that corresponds to the users’ search experience. In particular, we investigate the role of search task complexity, and show that it has a direct bearing on the number of relevant answer documents sought by users in response to an information need, suggesting that useful effectiveness metrics must be
goal sensitive
. We further suggest that user behavior while scanning results listings is affected by the rate at which their goal is being realized, and hence that appropriate effectiveness metrics must be
adaptive
to the presence (or not) of relevant documents in the ranking. In response to these two observations, we present a new effectiveness metric, INST, that has both of the desired properties: INST employs a parameter
T
, a direct measure of the user’s search goal that adjusts the top-weightedness of the evaluation score; moreover, as progress towards the target
T
is made, the modeled user behavior is adapted, to reflect the remaining expectations. INST is experimentally compared to previous effectiveness metrics, including Average Precision (AP), Normalized Discounted Cumulative Gain (NDCG), and Rank-Biased Precision (RBP), demonstrating our claims as to INST’s usefulness. Like RBP, INST is a weighted-precision metric, meaning that each score can be accompanied by a
residual
that quantifies the extent of the score uncertainty caused by unjudged documents. As part of our experimentation, we use crowd-sourced data and score residuals to demonstrate that a wide range of queries arise for even quite specific information needs, and that these variant queries introduce significant levels of residual uncertainty into typical experimental evaluations. These causes of variability have wide-reaching implications for experiment design, and for the construction of test collections.
Funder
Australian Research Council's Discovery Projects Scheme
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Reference82 articles.
1. Analysis of performance variation using query expansion
2. Crowdsourcing for relevance evaluation
3. L. W. Anderson and D. A. Krathwohl. 2001. A Taxonomy for Learning Teaching and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman. L. W. Anderson and D. A. Krathwohl. 2001. A Taxonomy for Learning Teaching and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman.
Cited by
71 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Query Variability and Experimental Consistency: A Concerning Case Study;Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval;2024-08-02
2. Evaluating Generative Ad Hoc Information Retrieval;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
3. The Treatment of Ties in Rank-Biased Overlap;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
4. Understanding users' dynamic perceptions of search gain and cost in sessions: An expectation confirmation model;Journal of the Association for Information Science and Technology;2024-06-17
5. Tutorial on User Simulation for Evaluating Information Access Systems on the Web;Companion Proceedings of the ACM Web Conference 2024;2024-05-13