Correlating Automated and Human Evaluation of Code Documentation Generation Quality

Author:

Hu Xing1,Chen Qiuyuan2,Wang Haoye2,Xia Xin3ORCID,Lo David4,Zimmermann Thomas5ORCID

Affiliation:

1. School of Software Technology, Zhejiang University, Ningbo, Zhejiang, China

2. College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China

3. Faculty of Information Technology, Monash University, VIC, Australia

4. School of Information Systems, Singapore Management University, Singapore

5. Microsoft Research, Redmond, WA, USA

Abstract

Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. These metrics compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate the presence or absence of correlations between these metrics and human judgments. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation (with moderate Pearson correlation r about 0.7) to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators (with a high Pearson correlation r about 0.8) and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation  [ 39 ]). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks.

Funder

National Science Foundation of China

National Research Foundation, Singapore

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Reference53 articles.

1. Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. Code2seq: Generating sequences from structured representations of code. In Proceedings of the International Conference on Learning Representations.

2. SPICE: Semantic Propositional Image Caption Evaluation

3. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15) San Diego CA USA May 7-9 2015 Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1409.0473.

4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.

5. Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise Reduction in Speech Processing. Springer, 1–4.

Cited by 10 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Transformers in source code generation: A comprehensive survey;Journal of Systems Architecture;2024-08

2. Automatic title completion for Stack Overflow posts and GitHub issues;Empirical Software Engineering;2024-07-25

3. Epic-Level Text Generation with LLM through Auto-prompted Reinforcement Learning;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

4. Test Code Generation for Telecom Software Systems Using Two-Stage Generative Model;2024 IEEE International Conference on Communications Workshops (ICC Workshops);2024-06-09

5. KADEL: Knowledge-Aware Denoising Learning for Commit Message Generation;ACM Transactions on Software Engineering and Methodology;2024-06-04

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3