Learning from what we know: How to perform vulnerability prediction using noisy historical data

Author:

Garg AayushORCID,Degiovanni Renzo,Jimenez Matthieu,Cordy Maxime,Papadakis Mike,Le Traon Yves

Abstract

AbstractVulnerability prediction refers to the problem of identifying system components that are most likely to be vulnerable. Typically, this problem is tackled by training binary classifiers on historical data. Unfortunately, recent research has shown that such approaches underperform due to the following two reasons: a) the imbalanced nature of the problem, and b) the inherently noisy historical data, i.e., most vulnerabilities are discovered much later than they are introduced. This misleads classifiers as they learn to recognize actual vulnerable components as non-vulnerable. To tackle these issues, we propose TROVON, a technique that learns from known vulnerable components rather than from vulnerable and non-vulnerable components, as typically performed. We perform this by contrasting the known vulnerable, and their respective fixed components. This way, TROVON manages to learn from the things we know, i.e., vulnerabilities, hence reducing the effects of noisy and unbalanced data. We evaluate TROVON by comparing it with existing techniques on three security-critical open source systems, i.e., Linux Kernel, OpenSSL, and Wireshark, with historical vulnerabilities that have been reported in the National Vulnerability Database (NVD). Our evaluation demonstrates that the prediction capability of TROVON significantly outperforms existing vulnerability prediction techniques such as Software Metrics, Imports, Function Calls, Text Mining, Devign, LSTM, and LSTM-RF with an improvement of 40.84% in Matthews Correlation Coefficient (MCC) score under Clean Training Data Settings, and an improvement of 35.52% under Realistic Training Data Settings.

Funder

Fonds National de la Recherche Luxembourg

Publisher

Springer Science and Business Media LLC

Subject

Software

Reference53 articles.

1. Abadi M, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org

2. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate

3. Britz D, Goldie A, Luong T, Le Q (2017) Massive exploration of neural machine translation architectures. arXiv e-prints

4. Brownlee J (2021) When to use mlp, cnn, and rnn neural networks. https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks. Accessed 1 May 2018

5. Brownlee J (2022) Encoder-decoder recurrent neural network models for neural machine translation. https://machinelearningmastery.com/encoder-decoder-recurrent-neural-network-models-neural-machine-translation/. Accessed 1 Feb 2018

Cited by 8 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. An Empirical Study of the Imbalance Issue in Software Vulnerability Detection;Computer Security – ESORICS 2023;2024

2. Enabling Efficient Assertion Inference;2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE);2023-10-09

3. When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection;2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE);2023-09-11

4. An Investigation of Quality Issues in Vulnerability Detection Datasets;2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW);2023-07

5. Syntactic Versus Semantic Similarity of Artificial and Real Faults in Mutation Testing Studies;IEEE Transactions on Software Engineering;2023-07

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3