Author:
Schuff Hendrik,Vanderlyn Lindsey,Adel Heike,Vu Ngoc Thang
Abstract
AbstractMany research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.
Publisher
Cambridge University Press (CUP)
Subject
Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software
Reference111 articles.
1. Howcroft, D.M. and Rieser, V. (2021). What happens if you treat ordinal ratings as interval data? human evaluations in NLP are even more under-powered than you think. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics, pp. 8932–8939. doi: 10.18653/v1/2021.emnlp-main.703. Available at https://aclanthology.org/2021.emnlp-main.703.
2. Bojar, O. , Federmann, C. , Haddow, B. , Koehn, P. , Post, M. and Specia, L. (2016). Ten years of wmt evaluation campaigns: Lessons learnt. In Proceedings of the LREC 2016 Workshop “Translation Evaluation–From Fragmented Tools and Data Sets to an Integrated Ecosystem, pp. 27–34.
3. Generalized Linear Mixed Models
4. Crowdworker Economics in the Gig Economy
5. The nuremberg code;Nuremberg;Trials of War Criminals Before the Nuremberg Military Tribunals Under Control Council Law,1949
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献