Home > Research > Publications & Outputs > Reflections on the NASA MDP data sets

Links

Text available via DOI:

View graph of relations

Reflections on the NASA MDP data sets

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published
Close
<mark>Journal publication date</mark>2012
<mark>Journal</mark>IET Software
Issue number6
Volume6
Number of pages10
Pages (from-to)549 - 558
Publication StatusPublished
<mark>Original language</mark>English

Abstract

Background: The NASA metrics data program (MDP) data sets have been heavily used in software defect prediction research. Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context. Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining. Conclusions: Firstly researchers need to analyse the data that forms the basis of their findings in the context of how it will be used. Secondly, the bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly because of repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.