Learning from what we know: How to perform vulnerability prediction using noisy historical data-Reference-Cited by-同舟云学术

Learning from what we know: How to perform vulnerability prediction using noisy historical data

Published:2022-09-20 Issue:7 Volume:27 Page:
ISSN:1382-3256
Container-title:Empirical Software Engineering
language:en
Short-container-title:Empir Software Eng

Author:

Garg Aayush^ORCID,Degiovanni Renzo,Jimenez Matthieu,Cordy Maxime,Papadakis Mike,Le Traon Yves

Abstract

AbstractVulnerability prediction refers to the problem of identifying system components that are most likely to be vulnerable. Typically, this problem is tackled by training binary classifiers on historical data. Unfortunately, recent research has shown that such approaches underperform due to the following two reasons: a) the imbalanced nature of the problem, and b) the inherently noisy historical data, i.e., most vulnerabilities are discovered much later than they are introduced. This misleads classifiers as they learn to recognize actual vulnerable components as non-vulnerable. To tackle these issues, we propose TROVON, a technique that learns from known vulnerable components rather than from vulnerable and non-vulnerable components, as typically performed. We perform this by contrasting the known vulnerable, and their respective fixed components. This way, TROVON manages to learn from the things we know, i.e., vulnerabilities, hence reducing the effects of noisy and unbalanced data. We evaluate TROVON by comparing it with existing techniques on three security-critical open source systems, i.e., Linux Kernel, OpenSSL, and Wireshark, with historical vulnerabilities that have been reported in the National Vulnerability Database (NVD). Our evaluation demonstrates that the prediction capability of TROVON significantly outperforms existing vulnerability prediction techniques such as Software Metrics, Imports, Function Calls, Text Mining, Devign, LSTM, and LSTM-RF with an improvement of 40.84% in Matthews Correlation Coefficient (MCC) score under Clean Training Data Settings, and an improvement of 35.52% under Realistic Training Data Settings.

Funder

Fonds National de la Recherche Luxembourg

Publisher

Springer Science and Business Media LLC

Subject

Software

Link

https://link.springer.com/content/pdf/10.1007/s10664-022-10197-4.pdf

Reference53 articles.

1. Abadi M, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org

2. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate

3. Britz D, Goldie A, Luong T, Le Q (2017) Massive exploration of neural machine translation architectures. arXiv e-prints

4. Brownlee J (2021) When to use mlp, cnn, and rnn neural networks. https://machinelearningmastery.com/when-to-use-mlp-cnn-and-rnn-neural-networks. Accessed 1 May 2018

5. Brownlee J (2022) Encoder-decoder recurrent neural network models for neural machine translation. https://machinelearningmastery.com/encoder-decoder-recurrent-neural-network-models-neural-machine-translation/. Accessed 1 Feb 2018

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Survey of source code vulnerability analysis based on deep learning;Computers & Security;2025-01

2. A comprehensive analysis on software vulnerability detection datasets: trends, challenges, and road ahead;International Journal of Information Security;2024-07-23

3. Early and Realistic Exploitability Prediction of Just-Disclosed Software Vulnerabilities: How Reliable Can It Be?;ACM Transactions on Software Engineering and Methodology;2024-06-27

4. Predicting software vulnerability based on software metrics: a deep learning approach;Iran Journal of Computer Science;2024-06-05

5. On the Coupling between Vulnerabilities and LLM-Generated Mutants: A Study on Vul4J Dataset;2024 IEEE Conference on Software Testing, Verification and Validation (ICST);2024-05-27