Home > Research > Publications & Outputs > On optimal selection of summary statistics for ...
View graph of relations

On optimal selection of summary statistics for approximate Bayesian computation

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Published

Standard

On optimal selection of summary statistics for approximate Bayesian computation. / Nunes, Matthew A.; Balding, David J.
In: Statistical Applications in Genetics and Molecular Biology, Vol. 9, No. 1, 34, 2010.

Research output: Contribution to Journal/MagazineJournal articlepeer-review

Harvard

Nunes, MA & Balding, DJ 2010, 'On optimal selection of summary statistics for approximate Bayesian computation', Statistical Applications in Genetics and Molecular Biology, vol. 9, no. 1, 34. https://doi.org/10.2202/1544-6115.1576

APA

Nunes, M. A., & Balding, D. J. (2010). On optimal selection of summary statistics for approximate Bayesian computation. Statistical Applications in Genetics and Molecular Biology, 9(1), Article 34. https://doi.org/10.2202/1544-6115.1576

Vancouver

Nunes MA, Balding DJ. On optimal selection of summary statistics for approximate Bayesian computation. Statistical Applications in Genetics and Molecular Biology. 2010;9(1):34. doi: 10.2202/1544-6115.1576

Author

Nunes, Matthew A. ; Balding, David J. / On optimal selection of summary statistics for approximate Bayesian computation. In: Statistical Applications in Genetics and Molecular Biology. 2010 ; Vol. 9, No. 1.

Bibtex

@article{05936214c54e42b4a073fb6c34192ccf,
title = "On optimal selection of summary statistics for approximate Bayesian computation",
abstract = "How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.",
keywords = "data reduction, computational statistics, likelihood free inference, entropy, sufficiency, CHAIN MONTE-CARLO, ENTROPY, LIKELIHOOD, VARIANCE, DISTRIBUTIONS, ESTIMATORS, INFERENCE, MODEL",
author = "Nunes, {Matthew A.} and Balding, {David J.}",
year = "2010",
doi = "10.2202/1544-6115.1576",
language = "English",
volume = "9",
journal = "Statistical Applications in Genetics and Molecular Biology",
issn = "2194-6302",
publisher = "Berkeley Electronic Press",
number = "1",

}

RIS

TY - JOUR

T1 - On optimal selection of summary statistics for approximate Bayesian computation

AU - Nunes, Matthew A.

AU - Balding, David J.

PY - 2010

Y1 - 2010

N2 - How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.

AB - How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.

KW - data reduction

KW - computational statistics

KW - likelihood free inference

KW - entropy

KW - sufficiency

KW - CHAIN MONTE-CARLO

KW - ENTROPY

KW - LIKELIHOOD

KW - VARIANCE

KW - DISTRIBUTIONS

KW - ESTIMATORS

KW - INFERENCE

KW - MODEL

U2 - 10.2202/1544-6115.1576

DO - 10.2202/1544-6115.1576

M3 - Journal article

VL - 9

JO - Statistical Applications in Genetics and Molecular Biology

JF - Statistical Applications in Genetics and Molecular Biology

SN - 2194-6302

IS - 1

M1 - 34

ER -