Affiliation:
1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
2. School of Communication and Design, Sun Yat-sen University, Guangzhou, China
3. School of Software Engineering, Sun Yat-sen University, Zhuhai, China
Abstract
With the fast development of large software projects, automatic code summarization techniques, which summarize the main functionalities of a piece of code using natural languages as comments, play essential roles in helping developers understand and maintain large software projects. Many research efforts have been devoted to building automatic code summarization approaches. Typical code summarization approaches are based on deep learning models. They transform the task into a sequence-to-sequence task, which inputs source code and outputs summarizations in natural languages. All code summarization models impose different input size limits, such as 50 to 10,000, for the input source code. However, how the input size limit affects the performance of code summarization models still remains under-explored. In this article, we first conduct an empirical study to investigate the impacts of different input size limits on the quality of generated code comments. To our surprise, experiments on multiple models and datasets reveal that setting a low input size limit, such as 20, does not necessarily reduce the quality of generated comments.
Based on this finding, we further propose to use function signatures instead of full source code to summarize the main functionalities first and then input the function signatures into code summarization models. Experiments and statistical results show that inputs with signatures are, on average, more than 2 percentage points better than inputs without signatures and thus demonstrate the effectiveness of involving function signatures in code summarization. We also invite programmers to do a questionnaire to evaluate the quality of code summaries generated by two inputs with different truncation levels. The results show that function signatures generate, on average, 9.2% more high-quality comments than full code.
Funder
National Key R&D Program of China
National Natural Science Foundation of China
Natural Science Foundation of Guangdong Province
Publisher
Association for Computing Machinery (ACM)
Reference80 articles.
1. A Transformer-based Approach for Source Code Summarization
2. Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18)). OpenReview.net.
3. Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In Proceedings of the 33nd International Conference on Machine Learning (ICML’16) (JMLR Workshop and Conference Proceedings), Maria-Florina Balcan and Kilian Q. Weinberger (Eds.). Vol. 48. JMLR.org, 2091–2100.
4. code2vec: learning distributed representations of code
5. Design of Comparative Experiments