Home > Research > Publications & Outputs > What is the impact of imbalance on software def...

Links

Text available via DOI:

View graph of relations

What is the impact of imbalance on software defect prediction performance?

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published

Standard

What is the impact of imbalance on software defect prediction performance? / Mahmood, Zaheed; Bowes, David; Lane, Peter C R et al.
PROMISE '15 Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. New York: Association for Computing Machinery, Inc, 2015. 4.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Harvard

Mahmood, Z, Bowes, D, Lane, PCR & Hall, T 2015, What is the impact of imbalance on software defect prediction performance? in PROMISE '15 Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering., 4, Association for Computing Machinery, Inc, New York, 11th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2015, Beijing, China, 21/10/15. https://doi.org/10.1145/2810146.2810150

APA

Mahmood, Z., Bowes, D., Lane, P. C. R., & Hall, T. (2015). What is the impact of imbalance on software defect prediction performance? In PROMISE '15 Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering Article 4 Association for Computing Machinery, Inc. https://doi.org/10.1145/2810146.2810150

Vancouver

Mahmood Z, Bowes D, Lane PCR, Hall T. What is the impact of imbalance on software defect prediction performance? In PROMISE '15 Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. New York: Association for Computing Machinery, Inc. 2015. 4 doi: 10.1145/2810146.2810150

Author

Mahmood, Zaheed ; Bowes, David ; Lane, Peter C R et al. / What is the impact of imbalance on software defect prediction performance?. PROMISE '15 Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering. New York : Association for Computing Machinery, Inc, 2015.

Bibtex

@inproceedings{3124a83f71e648a89de7cc1a5697f4a2,
title = "What is the impact of imbalance on software defect prediction performance?",
abstract = "Software defect prediction performance varies over a large range. Menzies suggested there is a ceiling effect of 80% Recall [8]. Most of the data sets used are highly imbalanced. This paper asks, what is the empirical effect of using different datasets with varying levels of imbalance on predictive performance? We use data synthesised by a previous meta-analysis of 600 fault prediction models and their results. Four model evaluation measures (the Mathews Correlation Coeficient (MCC), F-Measure, Precision and Re- call ) are compared to the corresponding data imbalance ratio. When the data are imbalanced, the predictive performance of software defect prediction studies is low. As the data become more balanced, the predictive performance of prediction models increases, from an average MCC of 0.15, until the minority class makes up 20% of the instances in the dataset, where the MCC reaches an average value of about 0.34. As the proportion of the minority class increases above 20%, the predictive performance does not significantly increase. Using datasets with more than 20% of the instances being defective has not had a significant impact on the predictive performance when using MCC. We conclude that comparing the results of defect prediction studies should take into account the imbalance of the data.",
keywords = "Data Imbalance, Defect Prediction, Machine Learning",
author = "Zaheed Mahmood and David Bowes and Lane, {Peter C R} and Tracy Hall",
year = "2015",
month = oct,
day = "21",
doi = "10.1145/2810146.2810150",
language = "English",
booktitle = "PROMISE '15 Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering",
publisher = "Association for Computing Machinery, Inc",
note = "11th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2015 ; Conference date: 21-10-2015",

}

RIS

TY - GEN

T1 - What is the impact of imbalance on software defect prediction performance?

AU - Mahmood, Zaheed

AU - Bowes, David

AU - Lane, Peter C R

AU - Hall, Tracy

PY - 2015/10/21

Y1 - 2015/10/21

N2 - Software defect prediction performance varies over a large range. Menzies suggested there is a ceiling effect of 80% Recall [8]. Most of the data sets used are highly imbalanced. This paper asks, what is the empirical effect of using different datasets with varying levels of imbalance on predictive performance? We use data synthesised by a previous meta-analysis of 600 fault prediction models and their results. Four model evaluation measures (the Mathews Correlation Coeficient (MCC), F-Measure, Precision and Re- call ) are compared to the corresponding data imbalance ratio. When the data are imbalanced, the predictive performance of software defect prediction studies is low. As the data become more balanced, the predictive performance of prediction models increases, from an average MCC of 0.15, until the minority class makes up 20% of the instances in the dataset, where the MCC reaches an average value of about 0.34. As the proportion of the minority class increases above 20%, the predictive performance does not significantly increase. Using datasets with more than 20% of the instances being defective has not had a significant impact on the predictive performance when using MCC. We conclude that comparing the results of defect prediction studies should take into account the imbalance of the data.

AB - Software defect prediction performance varies over a large range. Menzies suggested there is a ceiling effect of 80% Recall [8]. Most of the data sets used are highly imbalanced. This paper asks, what is the empirical effect of using different datasets with varying levels of imbalance on predictive performance? We use data synthesised by a previous meta-analysis of 600 fault prediction models and their results. Four model evaluation measures (the Mathews Correlation Coeficient (MCC), F-Measure, Precision and Re- call ) are compared to the corresponding data imbalance ratio. When the data are imbalanced, the predictive performance of software defect prediction studies is low. As the data become more balanced, the predictive performance of prediction models increases, from an average MCC of 0.15, until the minority class makes up 20% of the instances in the dataset, where the MCC reaches an average value of about 0.34. As the proportion of the minority class increases above 20%, the predictive performance does not significantly increase. Using datasets with more than 20% of the instances being defective has not had a significant impact on the predictive performance when using MCC. We conclude that comparing the results of defect prediction studies should take into account the imbalance of the data.

KW - Data Imbalance

KW - Defect Prediction

KW - Machine Learning

U2 - 10.1145/2810146.2810150

DO - 10.1145/2810146.2810150

M3 - Conference contribution/Paper

AN - SCOPUS:84947591877

BT - PROMISE '15 Proceedings of the 11th International Conference on Predictive Models and Data Analytics in Software Engineering

PB - Association for Computing Machinery, Inc

CY - New York

T2 - 11th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2015

Y2 - 21 October 2015

ER -