Home > Research > Publications & Outputs > Automatically Identifying Code Features for Sof...

Electronic data

  • automatically-identifying-code

    Rights statement: This is the author’s version of a work that was accepted for publication in Information and Software Technology. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information and Software Technology, 106, 2019 DOI: 10.1016/j.infsof.2018.10.001

    Accepted author manuscript, 658 KB, PDF document

    Available under license: CC BY-NC-ND: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Links

Text available via DOI:

View graph of relations

Automatically Identifying Code Features for Software Defect Prediction: Using AST N-grams

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

Automatically Identifying Code Features for Software Defect Prediction: Using AST N-grams. / Shippey, Thomas; Bowes, David ; Hall, Tracy.
In: Information and Software Technology, Vol. 106, 02.2019, p. 142-160.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

APA

Vancouver

Shippey T, Bowes D, Hall T. Automatically Identifying Code Features for Software Defect Prediction: Using AST N-grams. Information and Software Technology. 2019 Feb;106:142-160. Epub 2018 Oct 4. doi: 10.1016/j.infsof.2018.10.001

Author

Shippey, Thomas ; Bowes, David ; Hall, Tracy. / Automatically Identifying Code Features for Software Defect Prediction : Using AST N-grams. In: Information and Software Technology. 2019 ; Vol. 106. pp. 142-160.

Bibtex

@article{a5c8f961a3b6470bbea31727046e1b1c,
title = "Automatically Identifying Code Features for Software Defect Prediction: Using AST N-grams",
abstract = "Context: Identifying defects in code early is important. A wide range of static code metrics have been evaluated as potential defect indicators. Most of these metrics offer only high level insights and focus on particular pre-selected features of the code. None of the currently used metrics clearly performs best in defect prediction. Objective: We use Abstract Syntax Tree (AST) n-grams to identify features of defective Java code that improve defect prediction performance. Method: Our approach is bottom-up and does not rely on pre-selecting any specific features of code. We use non-parametric testing to determine relationships between AST n-grams and faults in both open source and commercial systems. We build defect prediction models using three machine learning techniques. Results: We show that AST n-grams are very significantly related to faults in some systems, with very large effect sizes. The occurrence of some frequently occurring AST n-grams in a method can mean that the method is up to three times more likely to contain a fault. AST n-grams can have a large effect on the performance of defect prediction models. Conclusions: We suggest that AST n-grams offer developers a promising approach to identifying potentially defective code.",
author = "Thomas Shippey and David Bowes and Tracy Hall",
note = "This is the author{\textquoteright}s version of a work that was accepted for publication in Information and Software Technology. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information and Software Technology, 106, 2019 DOI: 10.1016/j.infsof.2018.10.001",
year = "2019",
month = feb,
doi = "10.1016/j.infsof.2018.10.001",
language = "English",
volume = "106",
pages = "142--160",
journal = "Information and Software Technology",
issn = "0950-5849",
publisher = "Elsevier",

}

RIS

TY - JOUR

T1 - Automatically Identifying Code Features for Software Defect Prediction

T2 - Using AST N-grams

AU - Shippey, Thomas

AU - Bowes, David

AU - Hall, Tracy

N1 - This is the author’s version of a work that was accepted for publication in Information and Software Technology. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Information and Software Technology, 106, 2019 DOI: 10.1016/j.infsof.2018.10.001

PY - 2019/2

Y1 - 2019/2

N2 - Context: Identifying defects in code early is important. A wide range of static code metrics have been evaluated as potential defect indicators. Most of these metrics offer only high level insights and focus on particular pre-selected features of the code. None of the currently used metrics clearly performs best in defect prediction. Objective: We use Abstract Syntax Tree (AST) n-grams to identify features of defective Java code that improve defect prediction performance. Method: Our approach is bottom-up and does not rely on pre-selecting any specific features of code. We use non-parametric testing to determine relationships between AST n-grams and faults in both open source and commercial systems. We build defect prediction models using three machine learning techniques. Results: We show that AST n-grams are very significantly related to faults in some systems, with very large effect sizes. The occurrence of some frequently occurring AST n-grams in a method can mean that the method is up to three times more likely to contain a fault. AST n-grams can have a large effect on the performance of defect prediction models. Conclusions: We suggest that AST n-grams offer developers a promising approach to identifying potentially defective code.

AB - Context: Identifying defects in code early is important. A wide range of static code metrics have been evaluated as potential defect indicators. Most of these metrics offer only high level insights and focus on particular pre-selected features of the code. None of the currently used metrics clearly performs best in defect prediction. Objective: We use Abstract Syntax Tree (AST) n-grams to identify features of defective Java code that improve defect prediction performance. Method: Our approach is bottom-up and does not rely on pre-selecting any specific features of code. We use non-parametric testing to determine relationships between AST n-grams and faults in both open source and commercial systems. We build defect prediction models using three machine learning techniques. Results: We show that AST n-grams are very significantly related to faults in some systems, with very large effect sizes. The occurrence of some frequently occurring AST n-grams in a method can mean that the method is up to three times more likely to contain a fault. AST n-grams can have a large effect on the performance of defect prediction models. Conclusions: We suggest that AST n-grams offer developers a promising approach to identifying potentially defective code.

U2 - 10.1016/j.infsof.2018.10.001

DO - 10.1016/j.infsof.2018.10.001

M3 - Journal article

VL - 106

SP - 142

EP - 160

JO - Information and Software Technology

JF - Information and Software Technology

SN - 0950-5849

ER -