How interesting and coherent are the stories generated by a large‐scale neural language model? Comparing human and automatic evaluations of machine‐generated text-Reference-Cited by-同舟云学术

How interesting and coherent are the stories generated by a large‐scale neural language model? Comparing human and automatic evaluations of machine‐generated text

Published:2023-03-27 Issue:6 Volume:40 Page:
ISSN:0266-4720
Container-title:Expert Systems
language:en
Short-container-title:Expert Systems

Author:

Callan Dominic¹^ORCID,Foster Jennifer¹

Affiliation:

1. School of Computing Dublin City University Dublin Ireland

Abstract

AbstractEvaluation of the narrative text generated by machines has traditionally been a challenge, particularly when attempting to evaluate subjective elements such as interest or believability. Recent improvements in narrative machine text generation have been largely driven by the emergence of transformer‐based language models, trained on massive quantities of data, resulting in higher quality text generation. In this study, a corpus of stories is generated using the pre‐trained GPT‐Neo transformer model, with human‐written prompts as inputs upon which to base the narrative text. The stories generated through this process are subsequently evaluated through both human evaluation and two automated metrics: BERTScore and BERT Next Sentence Prediction, with the aim of determining whether there is a correlation between the automatic scores and the human judgements. The results show variation in human evaluation results in comparison to modern automated metrics, suggesting further work is required to train automated metrics to identify text that is defined as interesting by humans.

Publisher

Wiley

Subject

Artificial Intelligence,Computational Theory and Mathematics,Theoretical Computer Science,Control and Systems Engineering

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1111/exsy.13292

Reference31 articles.

1. Akoury N. Wang S. Whiting J. Hood S. Peng N. &Iyyer M.(2020).STORIUM: A dataset and evaluation platform for machine‐in‐the‐loop story generation. Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 6470–6484). Association for Computational Linguistics.https://aclanthology.org/2020.emnlp-main.525

2. Black S. Gao L. Wang P. Leahy C. &Biderman S.(2021).GPT‐Neo: Large scale autoregressive language modeling with mesh‐tensorflow.http://github.com/eleutherai/gpt-neo

3. Brown T. B. Mann B. Ryder N. Subbiah M. Kaplan J. Dhariwal P. Neelakantan A. Shyam P. Sastry G. Askell A. Agarwal S. Herbert‐Voss A. Krueger G. Henighan T. Child R. Ramesh A. Ziegler D. M. Wu J. Winter C. …Amodei D.(2020).Language models are few‐shot learners.arXiv Abs/2005.14165.

4. Celikyilmaz A. Clark E. &Gao J.(2020).Evaluation of text generation: A survey.arXiv Preprint arXiv:2006.14799.

5. Chaganty A. T. Mussman S. &Liang P.(2018).The price of debiasing automatic metrics in natural language evaluation.arXiv Preprint arXiv:1807.02202.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Use of artificial intelligence (AI) in augmentative and alternative communication (AAC): community consultation on risks, benefits and the need for a code of practice;Journal of Enabling Technologies;2024-08-13