Instance sampling in credit scoring: An empirical study of sample size and balancing

Management Science

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Instance sampling in credit scoring: An empirical study of sample size and balancing. / Crone, Sven F.; Finlay, Steven.
In: International Journal of Forecasting, Vol. 28, No. 1, 01.2012, p. 224-238.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Crone, SF & Finlay, S 2012, 'Instance sampling in credit scoring: An empirical study of sample size and balancing', International Journal of Forecasting, vol. 28, no. 1, pp. 224-238. https://doi.org/10.1016/j.ijforecast.2011.07.006

APA

Crone, S. F., & Finlay, S. (2012). Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting, 28(1), 224-238. https://doi.org/10.1016/j.ijforecast.2011.07.006

Vancouver

Crone SF , Finlay S. Instance sampling in credit scoring: An empirical study of sample size and balancing. International Journal of Forecasting. 2012 Jan;28(1):224-238. doi: 10.1016/j.ijforecast.2011.07.006

Author

Crone, Sven F. ; Finlay, Steven. / Instance sampling in credit scoring: An empirical study of sample size and balancing. In: International Journal of Forecasting. 2012 ; Vol. 28, No. 1. pp. 224-238.

Bibtex

@article{89b83914c7f2499a8fa1844d6cb6004d,

title = "Instance sampling in credit scoring: An empirical study of sample size and balancing",

abstract = "To date, best practice in sampling credit applicants has been established based largely on expert opinion, which generally recommends that small samples of 1500 instances each of both goods and bads are sufficient, and that the heavily biased datasets observed should be balanced by undersampling the majority class. Consequently, the topics of sample sizes and sample balance have not been subject to either formal study in credit scoring, or empirical evaluations across different data conditions and algorithms of varying efficiency. This paper describes an empirical study of instance sampling in predicting consumer repayment behaviour, evaluating the relative accuracies of logistic regression, discriminant analysis, decision trees and neural networks on two datasets across 20 samples of increasing size and 29 rebalanced sample distributions created by gradually under- and over-sampling the goods and bads respectively. The paper makes a practical contribution to model building on credit scoring datasets, and provides evidence that using samples larger than those recommended in credit scoring practice provides a significant increase in accuracy across algorithms. ",

keywords = "Credit scoring, Data pre-processing, Sample size , Under-sampling , Over-sampling , Balancing",

author = "Crone, {Sven F.} and Steven Finlay",

year = "2012",

month = jan,

doi = "10.1016/j.ijforecast.2011.07.006",

language = "English",

volume = "28",

pages = "224--238",

journal = "International Journal of Forecasting",

issn = "0169-2070",

publisher = "Elsevier Science B.V.",

number = "1",

}

RIS

TY - JOUR

T1 - Instance sampling in credit scoring: An empirical study of sample size and balancing

AU - Crone, Sven F.

AU - Finlay, Steven

PY - 2012/1

Y1 - 2012/1

N2 - To date, best practice in sampling credit applicants has been established based largely on expert opinion, which generally recommends that small samples of 1500 instances each of both goods and bads are sufficient, and that the heavily biased datasets observed should be balanced by undersampling the majority class. Consequently, the topics of sample sizes and sample balance have not been subject to either formal study in credit scoring, or empirical evaluations across different data conditions and algorithms of varying efficiency. This paper describes an empirical study of instance sampling in predicting consumer repayment behaviour, evaluating the relative accuracies of logistic regression, discriminant analysis, decision trees and neural networks on two datasets across 20 samples of increasing size and 29 rebalanced sample distributions created by gradually under- and over-sampling the goods and bads respectively. The paper makes a practical contribution to model building on credit scoring datasets, and provides evidence that using samples larger than those recommended in credit scoring practice provides a significant increase in accuracy across algorithms.

AB - To date, best practice in sampling credit applicants has been established based largely on expert opinion, which generally recommends that small samples of 1500 instances each of both goods and bads are sufficient, and that the heavily biased datasets observed should be balanced by undersampling the majority class. Consequently, the topics of sample sizes and sample balance have not been subject to either formal study in credit scoring, or empirical evaluations across different data conditions and algorithms of varying efficiency. This paper describes an empirical study of instance sampling in predicting consumer repayment behaviour, evaluating the relative accuracies of logistic regression, discriminant analysis, decision trees and neural networks on two datasets across 20 samples of increasing size and 29 rebalanced sample distributions created by gradually under- and over-sampling the goods and bads respectively. The paper makes a practical contribution to model building on credit scoring datasets, and provides evidence that using samples larger than those recommended in credit scoring practice provides a significant increase in accuracy across algorithms.

KW - Credit scoring

KW - Data pre-processing

KW - Sample size

KW - Under-sampling

KW - Over-sampling

KW - Balancing

U2 - 10.1016/j.ijforecast.2011.07.006

DO - 10.1016/j.ijforecast.2011.07.006

M3 - Journal article

VL - 28

SP - 224

EP - 238

JO - International Journal of Forecasting

JF - International Journal of Forecasting

SN - 0169-2070

IS - 1

ER -

Research

Associated organisational unit

Links

Text available via DOI:

Keywords