Incorporating Signal Awareness in Source Code Modeling: An Application to Vulnerability Detection

Author:

Suneja Sahil1ORCID,Zhuang Yufan2ORCID,Zheng Yunhui1ORCID,Laredo Jim1ORCID,Morari Alessandro1ORCID,Khurana Udayan1ORCID

Affiliation:

1. IBM Research T.J. Watson, NY

2. University of California, San Diego

Abstract

AI models of code have made significant progress over the past few years. However, many models are actually not learning task-relevant source code features. Instead, they often fit non-relevant but correlated data, leading to a lack of robustness and generalizability, and limiting the subsequent practical use of such models. In this work, we focus on improving the model quality through signal awareness , i.e., learning the relevant signals in the input for making predictions. We do so by leveraging the heterogeneity of code samples in terms of their signal-to-noise content. We perform an end-to-end exploration of model signal awareness, comprising: (i) uncovering the reliance of AI models of code on task-irrelevant signals, via prediction-preserving input minimization; (ii) improving models’ signal awareness by incorporating the notion of code complexity during model training, via curriculum learning; (iii) improving models’ signal awareness by generating simplified signal-preserving programs and augmenting them to the training dataset; and (iv) presenting a novel interpretation of the model learning behavior from the perspective of the dataset, using its code complexity distribution. We propose a new metric to measure model signal awareness, Signal-aware Recall, which captures how much of the model’s performance is attributable to task-relevant signal learning. Using a software vulnerability detection use-case, our model probing approach uncovers a significant lack of signal awareness in the models, across three different neural network architectures and three datasets. Signal-aware Recall is observed to be in the sub-50s for models with traditional Recall in the high 90s, suggesting that the models are presumably picking up a lot of noise or dataset nuances while learning their logic. With our code-complexity-aware model learning enhancement techniques, we are able to assist the models toward more task-relevant learning, recording up-to 4.8× improvement in model signal awareness. Finally, we employ our model learning introspection approach to uncover the aspects of source code where the model is facing difficulty, and we analyze how our learning enhancement techniques alleviate it.

Publisher

Association for Computing Machinery (ACM)

Subject

Software

Reference130 articles.

1. Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity checks for saliency maps. In Proceedings of the NeurIPS.

2. Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. In Proceedings of the ACL.

3. M. Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of the ACM SPLASH Onward!

4. M. Allamanis, E. Barr, C. Bird, and C. Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the FSE.

5. M. Allamanis M. Brockschmidt and M. Khademi. 2018. Learning to represent programs with graphs. InProceedings of the ICLR.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3