Incorporating Signal Awareness in Source Code Modeling: An Application to Vulnerability Detection-Reference-Cited by-同舟云学术

Incorporating Signal Awareness in Source Code Modeling: An Application to Vulnerability Detection

Published:2023-09-29 Issue:6 Volume:32 Page:1-40
ISSN:1049-331X
Container-title:ACM Transactions on Software Engineering and Methodology
language:en
Short-container-title:ACM Trans. Softw. Eng. Methodol.

Author:

Suneja Sahil¹^ORCID,Zhuang Yufan²^ORCID,Zheng Yunhui¹^ORCID,Laredo Jim¹^ORCID,Morari Alessandro¹^ORCID,Khurana Udayan¹^ORCID

Affiliation:

1. IBM Research T.J. Watson, NY

2. University of California, San Diego

Abstract

AI models of code have made significant progress over the past few years. However, many models are actually not learning task-relevant source code features. Instead, they often fit non-relevant but correlated data, leading to a lack of robustness and generalizability, and limiting the subsequent practical use of such models. In this work, we focus on improving the model quality through signal awareness , i.e., learning the relevant signals in the input for making predictions. We do so by leveraging the heterogeneity of code samples in terms of their signal-to-noise content. We perform an end-to-end exploration of model signal awareness, comprising: (i) uncovering the reliance of AI models of code on task-irrelevant signals, via prediction-preserving input minimization; (ii) improving models’ signal awareness by incorporating the notion of code complexity during model training, via curriculum learning; (iii) improving models’ signal awareness by generating simplified signal-preserving programs and augmenting them to the training dataset; and (iv) presenting a novel interpretation of the model learning behavior from the perspective of the dataset, using its code complexity distribution. We propose a new metric to measure model signal awareness, Signal-aware Recall, which captures how much of the model’s performance is attributable to task-relevant signal learning. Using a software vulnerability detection use-case, our model probing approach uncovers a significant lack of signal awareness in the models, across three different neural network architectures and three datasets. Signal-aware Recall is observed to be in the sub-50s for models with traditional Recall in the high 90s, suggesting that the models are presumably picking up a lot of noise or dataset nuances while learning their logic. With our code-complexity-aware model learning enhancement techniques, we are able to assist the models toward more task-relevant learning, recording up-to 4.8× improvement in model signal awareness. Finally, we employ our model learning introspection approach to uncover the aspects of source code where the model is facing difficulty, and we analyze how our learning enhancement techniques alleviate it.

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Link

https://dl.acm.org/doi/pdf/10.1145/3597202

Reference130 articles.

1. Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity checks for saliency maps. In Proceedings of the NeurIPS.

2. Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. In Proceedings of the ACL.

3. M. Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the ACM SPLASH Onward!

4. M. Allamanis, E. Barr, C. Bird, and C. Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the FSE.

5. M. Allamanis M. Brockschmidt and M. Khademi. 2018. Learning to represent programs with graphs. InProceedings of the ICLR.