Affiliation:
1. Zhejiang Sci-Tech University, Hangzhou, China
2. University of Science and Technology Beijing, Beijing, China
3. Shenzhen Technology University, Shenzhen, China
Abstract
Abstractive summarization (AS) systems, which aim to generate a text for summarizing crucial information of the original document, have been widely adopted in recent years. Unfortunately, factually unreliable summaries may still occur, leading to unexpected misunderstanding and distortion of information. This calls for methods that can properly evaluate the quality of AS systems. Yet, the existing reference-based evaluation approach for AS relies on reference summaries as well as automatic evaluation metrics (e.g., ROUGE). Therefore, the reference-based evaluation approach is highly restricted by the availability and quality of reference summaries as well as the capability of existing automatic evaluation metrics. In this study, we propose MTAS, a novel metamorphic testing based approach for evaluating AS in a reference-free way. Our two major contributions are (i) five metamorphic relations towards AS, which involve semantic-preserving and focus-preserving transformations at the document level, and (ii) a summary consistency evaluation metric SCY, which measures the alignment between a pair of summaries by incorporating both the semantic and factual consistency. Our experimental results show that the proposed metric SCY has a significantly higher correlation with human judgment as compared to a set of existing metrics. It is also demonstrated that MTAS can break the dependence on reference summaries, and it successfully reports a large number of summary inconsistencies, revealing various summarization issues on state-of-the-art AS systems.
Publisher
Association for Computing Machinery (ACM)
Reference53 articles.
1. Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. * SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity. 32–43.
2. Biasfinder: Metamorphic test generation to uncover bias for sentiment analysis systems;Asyrofi Muhammad Hilmi;IEEE Transactions on Software Engineering,2021
3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
4. Florian Böhm Yang Gao Christian M Meyer Ori Shapira Ido Dagan and Iryna Gurevych. 2019. Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214.
5. Rishi Bommasani and Claire Cardie. 2020. Intrinsic evaluation of summarization datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8075–8096.