Towards Learning Generalizable Code Embeddings Using Task-agnostic Graph Convolutional Networks

Author:

Ding Zishuo1ORCID,Li Heng2ORCID,Shang Weiyi1ORCID,Chen Tse-Hsun (Peter)1ORCID

Affiliation:

1. Concordia University, Montreal, QC, Canada

2. Polytechnique Montréal, Montreal, QC, Canada

Abstract

Code embeddings have seen increasing applications in software engineering (SE) research and practice recently. Despite the advances in embedding techniques applied in SE research, one of the main challenges is their generalizability. A recent study finds that code embeddings may not be readily leveraged for the downstream tasks that the embeddings are not particularly trained for. Therefore, in this article, we propose GraphCodeVec , which represents the source code as graphs and leverages the Graph Convolutional Networks to learn more generalizable code embeddings in a task-agnostic manner. The edges in the graph representation are automatically constructed from the paths in the abstract syntax trees, and the nodes from the tokens in the source code. To evaluate the effectiveness of GraphCodeVec , we consider three downstream benchmark tasks (i.e., code comment generation, code authorship identification, and code clones detection) that are used in a prior benchmarking of code embeddings and add three new downstream tasks (i.e., source code classification, logging statements prediction, and software defect prediction), resulting in a total of six downstream tasks that are considered in our evaluation. For each downstream task, we apply the embeddings learned by GraphCodeVec and the embeddings learned from four baseline approaches and compare their respective performance. We find that GraphCodeVec outperforms all the baselines in five out of the six downstream tasks, and its performance is relatively stable across different tasks and datasets. In addition, we perform ablation experiments to understand the impacts of the training context (i.e., the graph context extracted from the abstract syntax trees) and the training model (i.e., the Graph Convolutional Networks) on the effectiveness of the generated embeddings. The results show that both the graph context and the Graph Convolutional Networks can benefit GraphCodeVec in producing high-quality embeddings for the downstream tasks, while the improvement by Graph Convolutional Networks is more robust across different downstream tasks and datasets. Our findings suggest that future research and practice may consider using graph-based deep learning methods to capture the structural information of the source code for SE tasks.

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Reference96 articles.

1. Large-Scale and Language-Oblivious Code Authorship Identification

2. Suggesting accurate method and class names

3. Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the 6th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=BJOFETxR-.

4. Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A convolutional attention network for extreme summarization of source code. In Proceedings of the 33rd International Conference on Machine Learning series. JMLR.org, 2091–2100. Retrieved from http://proceedings.mlr.press/v48/allamanis16.html.

5. A general path-based representation for predicting program properties

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. LoGenText-Plus : Improving Neural Machine Translation Based Logging Texts Generation with Syntactic Templates;ACM Transactions on Software Engineering and Methodology;2023-12-22

2. Towards Utilizing Natural Language Processing Techniques to Assist in Software Engineering Tasks;2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion);2023-05

3. On the Temporal Relations between Logging and Code;2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE);2023-05

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3