How to do human evaluation: A brief introduction to user studies in NLP-Reference-Cited by-同舟云学术

How to do human evaluation: A brief introduction to user studies in NLP

Published:2023-02-06 Issue:5 Volume:29 Page:1199-1222
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

Schuff Hendrik,Vanderlyn Lindsey,Adel Heike,Vu Ngoc Thang

Abstract

AbstractMany research topics in natural language processing (NLP), such as explanation generation, dialog modeling, or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting, and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP-specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP, and provide starting points to select questionnaires, experimental designs, and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code, to bridge the gap between theoretical guidelines and practical applications.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference111 articles.

1. Howcroft, D.M. and Rieser, V. (2021). What happens if you treat ordinal ratings as interval data? human evaluations in NLP are even more under-powered than you think. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics, pp. 8932–8939. doi: 10.18653/v1/2021.emnlp-main.703. Available at https://aclanthology.org/2021.emnlp-main.703.

2. Bojar, O. , Federmann, C. , Haddow, B. , Koehn, P. , Post, M. and Specia, L. (2016). Ten years of wmt evaluation campaigns: Lessons learnt. In Proceedings of the LREC 2016 Workshop “Translation Evaluation–From Fragmented Tools and Data Sets to an Integrated Ecosystem, pp. 27–34.

3. Generalized Linear Mixed Models

4. Crowdworker Economics in the Gig Economy

5. The nuremberg code;Nuremberg;Trials of War Criminals Before the Nuremberg Military Tribunals Under Control Council Law,1949

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Dialogue agents 101: a beginner’s guide to critical ingredients for designing effective conversational systems;Natural Language Processing;2024-09-09

2. Thought flow nets: From single predictions to trains of model thought;Natural Language Processing;2024-09-06

3. LLMChain: Blockchain-Based Reputation System for Sharing and Evaluating Large Language Models;2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC);2024-07-02

4. Enhancing Language Learning Through Human-Computer Interaction and Generative AI: LATILL Platform;Lecture Notes in Computer Science;2024

5. Improving and Understanding Clarifying Question Generation in Conversational Search;Lecture Notes in Computer Science;2024