PassSum: Leveraging paths of abstract syntax trees and self‐supervision for code summarization-Reference-Cited by-同舟云学术

PassSum: Leveraging paths of abstract syntax trees and self‐supervision for code summarization

Published:2023-10-02 Issue: Volume: Page:
ISSN:2047-7473
Container-title:Journal of Software: Evolution and Process
language:en
Short-container-title:J Software Evolu Process

Author:

Niu Changan¹,Li Chuanyi¹^ORCID,Ng Vincent²,Ge Jidong¹^ORCID,Huang Liguo³,Luo Bin¹

Affiliation:

1. National Key Laboratory for Novel Software Technology Nanjing University Nanjing China

2. Human Language Technology Research Institute University of Texas at Dallas Richardson Texas USA

3. Department of Computer Science Southern Methodist University Dallas Texas USA

Abstract

AbstractCode summarization is to provide a high‐level comment for a code snippet that typically describes the function and intent of the given code. Recent years have seen the successful application of data‐driven code summarization. To improve the performance of the model, numerous approaches use abstract syntax trees (ASTs) to represent the structural information of the code, which is considered by most researchers to be the main factor that distinguishes code from natural language. Then, such data‐driven methods are trained on large‐scale labeled datasets to obtain a model with strong generalization capabilities that can be applied to new examples. Nevertheless, we argue that state‐of‐the‐art approaches suffer from two key weaknesses: (1) inefficient encoding of ASTs; (2) reliance on a large labeled corpus for model training. As a result, such drawbacks lead to (1) oversized model, slow training, information loss and instability; (2) inability to be applied to programming languages with only a small amount of labeled data. In light of these weaknesses, we propose PassSum, a code summarization approach that addresses the aforementioned weaknesses via (1) a novel input representation which contains an efficient AST encoding method; (2) introducing three pretraining objectives and pretraining our model with a large amount of (easy‐to‐obtain) unlabeled data under the guidance of self‐supervised learning. Experimental results on code summarization for Java, Python, and Ruby methods demonstrate the superiority of PassSum to state‐of‐the‐art methods. Further experiments demonstrate that the input representation we use has both temporal and spatial advantages in addition to performance leadership. In addition, pretraining is also shown to make the model more generalizable with less labeled data, and also to speed up the convergence of the model during training.

Funder

Huawei Technologies

National Natural Science Foundation of China

National Science Foundation

Natural Science Foundation of Jiangsu Province

Publisher

Wiley

Subject

Software

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/smr.2620

Reference77 articles.

1. SridharaG HillE MuppaneniD PollockL Vijay‐ShankerK.Towards automatically generating summary comments for java methods. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering;2010:43‐52.

2. Control flow and data structure documentation

3. NorcioAF.Indentation documentation and programmer comprehension. In: Proceedings of the 1982 Conference on Human Factors in Computing Systems;1982:118‐120.

4. Procedures and comments vs. the banker's algorithm;Tenny T;Acm Sigcse Bull,1985

5. Impact of programming features on code readability;Tashtoush Y;Int J Softw Eng Appl,2013