The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance

Computing and Communications

Text available via DOI:

https://doi.org/10.1016/j.jss.2024.112003
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Keywords

Software vulnerability prediction, Vulnerability datasets, Machine learning

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance. / Debeyan, Fahad; Madeyski, Lech; Hall, Tracy et al.
In: Journal of Systems and Software, Vol. 211, 112003, 31.05.2024.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Debeyan, F, Madeyski, L, Hall, T & Bowes, D 2024, 'The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance', Journal of Systems and Software, vol. 211, 112003. https://doi.org/10.1016/j.jss.2024.112003

APA

Debeyan, F., Madeyski, L., Hall, T., & Bowes, D. (2024). The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance. Journal of Systems and Software, 211, Article 112003. https://doi.org/10.1016/j.jss.2024.112003

Vancouver

Debeyan F, Madeyski L, Hall T , Bowes D. The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance. Journal of Systems and Software. 2024 May 31;211:112003. Epub 2024 Feb 21. doi: 10.1016/j.jss.2024.112003

Author

Debeyan, Fahad ; Madeyski, Lech ; Hall, Tracy et al. / The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance. In: Journal of Systems and Software. 2024 ; Vol. 211.

Bibtex

@article{49dcdbd6ab264a50b898e14fe2f0db13,

title = "The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance",

abstract = "Vulnerability prediction models have been shown to perform poorly in the real world. We examine how the composition of negative training data influences vulnerability prediction model performance. Inspired by other disciplines (e.g. image processing), we focus on whether distinguishing between negative training data that is {\textquoteleft}easy{\textquoteright} to recognise from positive data (very different from positive data) and negative training data that is {\textquoteleft}hard{\textquoteright} to recognise from positive data (very similar to positive data) impacts on vulnerability prediction performance. We use a range of popular machine learning algorithms, including deep learning, to build models based on vulnerability patch data curated by Reis and Abreu, as well as the MSR dataset. Our results suggest that models trained on higher ratios of easy negatives perform better, plateauing at 15 easy negatives per positive instance. We also report that different ML algorithms work better based on the negative sample used. Overall, we found that the negative sampling approach used significantly impacts model performance, potentially leading to overly optimistic results. The ratio of {\textquoteleft}easy{\textquoteright} versus {\textquoteleft}hard{\textquoteright} negative training data should be explicitly considered when building vulnerability prediction models for the real world.",

keywords = "Software vulnerability prediction, Vulnerability datasets, Machine learning",

author = "Fahad Debeyan and Lech Madeyski and Tracy Hall and David Bowes",

year = "2024",

month = may,

day = "31",

doi = "10.1016/j.jss.2024.112003",

language = "English",

volume = "211",

journal = "Journal of Systems and Software",

issn = "0164-1212",

publisher = "Elsevier Inc.",

}

RIS

TY - JOUR

T1 - The Impact of Hard and Easy Negative Training Data on Vulnerability Prediction Performance

AU - Debeyan, Fahad

AU - Madeyski, Lech

AU - Hall, Tracy

AU - Bowes, David

PY - 2024/5/31

Y1 - 2024/5/31

N2 - Vulnerability prediction models have been shown to perform poorly in the real world. We examine how the composition of negative training data influences vulnerability prediction model performance. Inspired by other disciplines (e.g. image processing), we focus on whether distinguishing between negative training data that is ‘easy’ to recognise from positive data (very different from positive data) and negative training data that is ‘hard’ to recognise from positive data (very similar to positive data) impacts on vulnerability prediction performance. We use a range of popular machine learning algorithms, including deep learning, to build models based on vulnerability patch data curated by Reis and Abreu, as well as the MSR dataset. Our results suggest that models trained on higher ratios of easy negatives perform better, plateauing at 15 easy negatives per positive instance. We also report that different ML algorithms work better based on the negative sample used. Overall, we found that the negative sampling approach used significantly impacts model performance, potentially leading to overly optimistic results. The ratio of ‘easy’ versus ‘hard’ negative training data should be explicitly considered when building vulnerability prediction models for the real world.

AB - Vulnerability prediction models have been shown to perform poorly in the real world. We examine how the composition of negative training data influences vulnerability prediction model performance. Inspired by other disciplines (e.g. image processing), we focus on whether distinguishing between negative training data that is ‘easy’ to recognise from positive data (very different from positive data) and negative training data that is ‘hard’ to recognise from positive data (very similar to positive data) impacts on vulnerability prediction performance. We use a range of popular machine learning algorithms, including deep learning, to build models based on vulnerability patch data curated by Reis and Abreu, as well as the MSR dataset. Our results suggest that models trained on higher ratios of easy negatives perform better, plateauing at 15 easy negatives per positive instance. We also report that different ML algorithms work better based on the negative sample used. Overall, we found that the negative sampling approach used significantly impacts model performance, potentially leading to overly optimistic results. The ratio of ‘easy’ versus ‘hard’ negative training data should be explicitly considered when building vulnerability prediction models for the real world.

KW - Software vulnerability prediction

KW - Vulnerability datasets

KW - Machine learning

U2 - 10.1016/j.jss.2024.112003

DO - 10.1016/j.jss.2024.112003

M3 - Journal article

VL - 211

JO - Journal of Systems and Software

JF - Journal of Systems and Software

SN - 0164-1212

M1 - 112003

ER -

Research

Links

Text available via DOI:

Keywords