Toward Interpretable Graph Tensor Convolution Neural Network for Code Semantics Embedding-Reference-Cited by-同舟云学术

Toward Interpretable Graph Tensor Convolution Neural Network for Code Semantics Embedding

Published:2023-07-21 Issue:5 Volume:32 Page:1-40
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Yang Jia¹^ORCID,Fu Cai¹^ORCID,Deng Fengyang²^ORCID,Wen Ming²^ORCID,Guo Xiaowei³^ORCID,Wan Chuanhao³^ORCID

Affiliation:

1. Hubei Key Laboratory of Distributed System Security, Hubei EngineeringResearch Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology, China

2. Huazhong University of Science and Technology, China

3. Huazhong University of Science and Technology

Abstract

Intelligent deep learning-based models have made significant progress for automated source code semantics embedding, and current research works mainly leverage natural language-based methods and graph-based methods. However, natural language-based methods do not capture the rich semantic structural information of source code, and graph-based methods do not utilize rich distant information of source code due to the high cost of message-passing steps. In this article, we propose a novel interpretable model, called graph tensor convolution neural network (GTCN), to generate accurate code embedding, which is capable of comprehensively capturing the distant information of code sequences and rich code semantics structural information. First, we propose to utilize a high-dimensional tensor to integrate various heterogeneous code graphs with node sequence features, such as control flow, data flow. Second, inspired by the current advantages of graph-based deep learning and efficient tensor computations, we propose a novel interpretable graph tensor convolution neural network for learning accurate code semantic embedding from the code graph tensor. Finally, we evaluate three popular applications on the GTCN model: variable misuse detection, source code prediction, and vulnerability detection. Compared with current state-of-the-art methods, our model achieves higher scores with respect to the top-1 accuracy while costing less training time.

Funder

China NSF

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Link

https://dl.acm.org/doi/pdf/10.1145/3582574

Reference81 articles.

1. Yaqin Zhou. 2019. Source codes of the paper: Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. https://github.com/epicosy/devign.

2. Yu Wang. 2020. Source codes of the paper: Learning semantic program embeddings with graph interval neural network. https://github.com/GINN-Imp/GINN.

3. Vincent J. Hellendoorn. 2020. Source codes of the paper: Global relational models of source code. https://github.com/VHellendoorn/ICLR20-Great.

4. Zhangyin Feng. 2021. Source codes of the paper: CodeBERT: A pre-trained model for programming and natural languages. https://github.com/microsoft/CodeBERT.

5. Jia Yang. 2022. Source codes of this paper. https://gitee.com/cse-sss/GTCN.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Large language model ChatGPT versus small deep learning models for self‐admitted technical debt detection: Why not together?;Software: Practice and Experience;2024-06-28