Rights statement: © 2016 ACM. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in EASE '16 Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering http://dx.doi.org/10.1145/2915970.2916007
Accepted author manuscript, 183 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License
Final published version
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
}
TY - GEN
T1 - The Jinx on the NASA software defect data sets
AU - Petrić, Jean
AU - Bowes, David
AU - Hall, Tracy
AU - Christianson, Bruce
AU - Baddoo, Nathan
N1 - © 2016 ACM. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in EASE '16 Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering http://dx.doi.org/10.1145/2915970.2916007
PY - 2016/6/1
Y1 - 2016/6/1
N2 - Background: The NASA datasets have previously been used extensively in studies of software defects. In 2013 Shepperd et al. presented an essential set of rules for removing erroneous data from the NASA datasets making this data more reliable to use. Objective: We have now found additional rules necessary for removing problematic data which were not identified by Shepperd et al. Results: In this paper, we demonstrate the level of erroneous data still present even after cleaning using Shepperd et al.'s rules and apply our new rules to remove this erroneous data. Conclusion: Even after systematic data cleaning of the NASA MDP datasets, we found new erroneous data. Data quality should always be explicitly considered by researchers before use.
AB - Background: The NASA datasets have previously been used extensively in studies of software defects. In 2013 Shepperd et al. presented an essential set of rules for removing erroneous data from the NASA datasets making this data more reliable to use. Objective: We have now found additional rules necessary for removing problematic data which were not identified by Shepperd et al. Results: In this paper, we demonstrate the level of erroneous data still present even after cleaning using Shepperd et al.'s rules and apply our new rules to remove this erroneous data. Conclusion: Even after systematic data cleaning of the NASA MDP datasets, we found new erroneous data. Data quality should always be explicitly considered by researchers before use.
KW - Data quality
KW - Machine learning
KW - Software defect prediction
U2 - 10.1145/2915970.2916007
DO - 10.1145/2915970.2916007
M3 - Conference contribution/Paper
AN - SCOPUS:84978484033
BT - EASE '16 Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering
PB - Association for Computing Machinery, Inc
CY - New York
T2 - 20th International Conference on Evaluation and Assessment in Software Engineering, EASE 2016
Y2 - 1 June 2016 through 3 June 2016
ER -