Common Flaws in Running Human Evaluation Experiments in NLP-Reference-Cited by-同舟云学术

Common Flaws in Running Human Evaluation Experiments in NLP

Published:2024 Issue:2 Volume:50 Page:795-805
ISSN:0891-2017
Container-title:Computational Linguistics
language:en
Short-container-title:

Author:

Thomson Craig¹,Reiter Ehud²,Belz Anya³

Affiliation:

1. ADAPT, Dublin City University and University of Aberdeen. c.thomson.nlp@gmail.com

2. Department of Computing Science, University of Aberdeen. e.reiter@abdn.ac.uk

3. ADAPT, Dublin City University and University of Aberdeen. anya.belz@adaptcentre.ie

Abstract

Abstract While conducting a coordinated set of repeat runs of human evaluation experiments in NLP, we discovered flaws in every single experiment we selected for inclusion via a systematic process. In this squib, we describe the types of flaws we discovered, which include coding errors (e.g., loading the wrong system outputs to evaluate), failure to follow standard scientific practice (e.g., ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g., reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigor of NLP evaluation experiments as currently conducted. We discuss what researchers can do to reduce the occurrence of such flaws, including pre-registration, better code development practices, increased testing and piloting, and post-publication addressing of errors.

Publisher

MIT Press

Link

https://direct.mit.edu/coli/article-pdf/50/2/795/2456321/coli_a_00508.pdf

Reference24 articles.

1. Reproducibility in computational linguistics: Is source code enough?;Arvan,2022

2. Reproducibility of exploring neural text simplification models: A review;Arvan,2022

3. A metrological perspective on reproducibility in NLP;Belz;Computational Linguistics,2022

4. A systematic review of reproducibility research in natural language processing;Belz,2021

5. Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP;Belz,2023