Online and Offline Evaluation in Search Clarification-Reference-Cited by-同舟云学术

Online and Offline Evaluation in Search Clarification

Published:2024-07-25 Issue: Volume: Page:
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Tavakoli Leila¹^ORCID,Trippas Johanne R.¹^ORCID,Zamani Hamed²^ORCID,Scholer Falk¹^ORCID,Sanderson Mark¹^ORCID

Affiliation:

1. RMIT University, Australia

2. University of Massachusetts Amherst, United States

Abstract

The effectiveness of clarification question models in engaging users within search systems is currently constrained, casting doubt on their overall usefulness. To improve the performance of these models, it is crucial to employ assessment approaches that encompass both real-time feedback from users (online evaluation) and the characteristics of clarification questions evaluated through human assessment (offline evaluation). However, the relationship between online and offline evaluations has been debated in information retrieval. This study aims to investigate how this discordance holds in search clarification. We use user engagement as ground truth and employ several offline labels to investigate to what extent the offline ranked lists of clarification resemble the ideal ranked lists based on online user engagement. Contrary to the current understanding that offline evaluations fall short of supporting online evaluations, we indicate that when identifying the most engaging clarification questions from the user’s perspective, online and offline evaluations correspond with each other. We show that the query length does not influence the relationship between online and offline evaluations, and reducing uncertainty in online evaluation strengthens this relationship. We illustrate that an engaging clarification needs to excel from multiple perspectives, and SERP quality and characteristics of the clarification are equally important. We also investigate if human labels can enhance the performance of Large Language Models (LLMs) and Learning-to-Rank (LTR) models in identifying the most engaging clarification questions from the user’s perspective by incorporating offline evaluations as input features. Our results indicate that Learning-to-Rank models do not perform better than individual offline labels. However, GPT, an LLM, emerges as the standout performer, surpassing all Learning-to-Rank models and offline labels.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3681786

Reference82 articles.

1. Generating labels from clicks

2. The effect of user characteristics on search effectiveness in information retrieval

3. Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2020. ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue Systems (ClariQ). arXiv preprint arXiv:2009.11352 (2020).

4. Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2021. Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4473–4484.

5. Asking Clarifying Questions in Open-Domain Information-Seeking Conversations