Affiliation:
1. University of Nottingham, Nottingham, UK
2. University of Nottingham & Nokia Bell Labs, Cambridge, UK
Abstract
Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1)
reliability
: the ability to detect “actual” performance differences as opposed to those observed by chance; (2)
fidelity
: the ability to agree with ultimate user preference; and (3)
intuitiveness
: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. What Matters in a Measure? A Perspective from Large-Scale Search Evaluation;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
2. The Eighth Workshop on Search-Oriented Conversational Artificial Intelligence (SCAI’24);Proceedings of the 2024 ACM SIGIR Conference on Human Information Interaction and Retrieval;2024-03-10
3. User Simulation for Evaluating Information Access Systems;Foundations and Trends® in Information Retrieval;2024
4. Understanding Users’ Confidence in Spoken Queries for Conversational Search Systems;Communications in Computer and Information Science;2024
5. A Survey on Review - Aware Recommendation Systems;Proceedings of the 29th Brazilian Symposium on Multimedia and the Web;2023-10-23