A Good-Turing estimator for feature allocation models

School Of Mathematical Sciences

Associated organisational unit

Statistical Artificial Intelligence

Text available via DOI:

https://doi.org/10.1214/19-EJS1614
Final published version
Available under license: CC BY: Creative Commons Attribution 4.0 International License

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

A Good-Turing estimator for feature allocation models. / Ayed, Fadhel; Battiston, Marco; Camerlenghi, Federico et al.
In: Electronic Journal of Statistics, Vol. 13, No. 2, 01.10.2019, p. 3775-3804.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Ayed, F, Battiston, M, Camerlenghi, F & Favaro, S 2019, 'A Good-Turing estimator for feature allocation models', Electronic Journal of Statistics, vol. 13, no. 2, pp. 3775-3804. https://doi.org/10.1214/19-EJS1614

APA

Ayed, F., Battiston, M., Camerlenghi, F., & Favaro, S. (2019). A Good-Turing estimator for feature allocation models. Electronic Journal of Statistics, 13(2), 3775-3804. https://doi.org/10.1214/19-EJS1614

Vancouver

Ayed F, Battiston M, Camerlenghi F, Favaro S. A Good-Turing estimator for feature allocation models. Electronic Journal of Statistics. 2019 Oct 1;13(2):3775-3804. Epub 2019 Sept 13. doi: 10.1214/19-EJS1614

Author

Ayed, Fadhel ; Battiston, Marco ; Camerlenghi, Federico et al. / A Good-Turing estimator for feature allocation models. In: Electronic Journal of Statistics. 2019 ; Vol. 13, No. 2. pp. 3775-3804.

Bibtex

@article{6c3693cf89404887b851650b38d5fc54,

title = "A Good-Turing estimator for feature allocation models",

abstract = "Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, we assume n observable samples and we consider the problem of estimating the expected number Mn of hitherto unseen features that would be observed if one additional individual was sampled. The interest in estimating Mn is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We consider a nonparametric estimator M^n of Mn which has the same analytic form of the popular Good-Turing estimator of the missing mass in the context of species sampling models. We show that M^n admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator. Furthermore, we give provable guarantees for the performance of M^n in terms of minimax rate optimality, and we provide with an interesting connection between M^n and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for M^n, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.",

author = "Fadhel Ayed and Marco Battiston and Federico Camerlenghi and Stefano Favaro",

year = "2019",

month = oct,

day = "1",

doi = "10.1214/19-EJS1614",

language = "English",

volume = "13",

pages = "3775--3804",

journal = "Electronic Journal of Statistics",

issn = "1935-7524",

publisher = "Institute of Mathematical Statistics",

number = "2",

}

RIS

TY - JOUR

T1 - A Good-Turing estimator for feature allocation models

AU - Ayed, Fadhel

AU - Battiston, Marco

AU - Camerlenghi, Federico

AU - Favaro, Stefano

PY - 2019/10/1

Y1 - 2019/10/1

N2 - Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, we assume n observable samples and we consider the problem of estimating the expected number Mn of hitherto unseen features that would be observed if one additional individual was sampled. The interest in estimating Mn is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We consider a nonparametric estimator M^n of Mn which has the same analytic form of the popular Good-Turing estimator of the missing mass in the context of species sampling models. We show that M^n admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator. Furthermore, we give provable guarantees for the performance of M^n in terms of minimax rate optimality, and we provide with an interesting connection between M^n and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for M^n, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.

AB - Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, we assume n observable samples and we consider the problem of estimating the expected number Mn of hitherto unseen features that would be observed if one additional individual was sampled. The interest in estimating Mn is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We consider a nonparametric estimator M^n of Mn which has the same analytic form of the popular Good-Turing estimator of the missing mass in the context of species sampling models. We show that M^n admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator. Furthermore, we give provable guarantees for the performance of M^n in terms of minimax rate optimality, and we provide with an interesting connection between M^n and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for M^n, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.

U2 - 10.1214/19-EJS1614

DO - 10.1214/19-EJS1614

M3 - Journal article

VL - 13

SP - 3775

EP - 3804

JO - Electronic Journal of Statistics

JF - Electronic Journal of Statistics

SN - 1935-7524

IS - 2

ER -

Research

Associated organisational unit

Links

Text available via DOI: