Do Code Summarization Models Process Too Much Information? Function Signature May Be All That Is Needed-Reference-Cited by-同舟云学术

Do Code Summarization Models Process Too Much Information? Function Signature May Be All That Is Needed

Published:2024-06-27 Issue:6 Volume:33 Page:1-35
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Ding Xi¹^ORCID,Peng Rui¹^ORCID,Chen Xiangping²^ORCID,Huang Yuan³^ORCID,Bian Jing¹^ORCID,Zheng Zibin³^ORCID

Affiliation:

1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China

2. School of Communication and Design, Sun Yat-sen University, Guangzhou, China

3. School of Software Engineering, Sun Yat-sen University, Zhuhai, China

Abstract

With the fast development of large software projects, automatic code summarization techniques, which summarize the main functionalities of a piece of code using natural languages as comments, play essential roles in helping developers understand and maintain large software projects. Many research efforts have been devoted to building automatic code summarization approaches. Typical code summarization approaches are based on deep learning models. They transform the task into a sequence-to-sequence task, which inputs source code and outputs summarizations in natural languages. All code summarization models impose different input size limits, such as 50 to 10,000, for the input source code. However, how the input size limit affects the performance of code summarization models still remains under-explored. In this article, we first conduct an empirical study to investigate the impacts of different input size limits on the quality of generated code comments. To our surprise, experiments on multiple models and datasets reveal that setting a low input size limit, such as 20, does not necessarily reduce the quality of generated comments. Based on this finding, we further propose to use function signatures instead of full source code to summarize the main functionalities first and then input the function signatures into code summarization models. Experiments and statistical results show that inputs with signatures are, on average, more than 2 percentage points better than inputs without signatures and thus demonstrate the effectiveness of involving function signatures in code summarization. We also invite programmers to do a questionnaire to evaluate the quality of code summaries generated by two inputs with different truncation levels. The results show that function signatures generate, on average, 9.2% more high-quality comments than full code.

Funder

National Key R&D Program of China

National Natural Science Foundation of China

Natural Science Foundation of Guangdong Province

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3652156

Reference80 articles.

1. A Transformer-based Approach for Source Code Summarization

2. Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18)). OpenReview.net.

3. Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In Proceedings of the 33nd International Conference on Machine Learning (ICML’16) (JMLR Workshop and Conference Proceedings), Maria-Florina Balcan and Kilian Q. Weinberger (Eds.). Vol. 48. JMLR.org, 2091–2100.

4. code2vec: learning distributed representations of code

5. Design of Comparative Experiments