Implementing Monte Carlo Tests with P-value Buckets

School Of Mathematical Sciences

Electronic data

1703.09305
Accepted author manuscript, 618 KB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Keywords

stat.ME

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article

Published

Standard

Implementing Monte Carlo Tests with P-value Buckets. / Gandy, Axel; Hahn, Georg; Ding, Dong.
In: arxiv.org, 27.03.2017.

Research output: Contribution to Journal/Magazine › Journal article

Bibtex

@article{8ad1522e5f7d48188111e0af03ed7aa7,

title = "Implementing Monte Carlo Tests with P-value Buckets",

abstract = "Software packages usually report the results of statistical tests using p-values. Users often interpret these by comparing them to standard thresholds, e.g. 0.1%, 1% and 5%, which is sometimes reinforced by a star rating (***, **, *). In this article, we consider an arbitrary statistical test whose p-value p is not available explicitly, but can be approximated by Monte Carlo samples, e.g. by bootstrap or permutation tests. The standard implementation of such tests usually draws a fixed number of samples to approximate p. However, the probability that the exact and the approximated p-value lie on different sides of a threshold (the resampling risk) can be high, particularly for p-values close to a threshold. We present a method to overcome this. We consider a finite set of user-specified intervals which cover [0,1] and which can be overlapping. We call these p-value buckets. We present algorithms that, with arbitrarily high probability, return a p-value bucket containing p. We prove that for both a bounded resampling risk and a finite runtime, overlapping buckets need to be employed, and that our methods both bound the resampling risk and guarantee a finite runtime for such overlapping buckets. To interpret decisions with overlapping buckets, we propose an extension of the star rating system. We demonstrate that our methods are suitable for use in standard software, including for low p-values occurring in multiple testing settings, and that they can be computationally more efficient than standard implementations.",

keywords = "stat.ME",

author = "Axel Gandy and Georg Hahn and Dong Ding",

year = "2017",

month = mar,

day = "27",

language = "English",

journal = "arxiv.org",

}

RIS

TY - JOUR

T1 - Implementing Monte Carlo Tests with P-value Buckets

AU - Gandy, Axel

AU - Hahn, Georg

AU - Ding, Dong

PY - 2017/3/27

Y1 - 2017/3/27

N2 - Software packages usually report the results of statistical tests using p-values. Users often interpret these by comparing them to standard thresholds, e.g. 0.1%, 1% and 5%, which is sometimes reinforced by a star rating (***, **, *). In this article, we consider an arbitrary statistical test whose p-value p is not available explicitly, but can be approximated by Monte Carlo samples, e.g. by bootstrap or permutation tests. The standard implementation of such tests usually draws a fixed number of samples to approximate p. However, the probability that the exact and the approximated p-value lie on different sides of a threshold (the resampling risk) can be high, particularly for p-values close to a threshold. We present a method to overcome this. We consider a finite set of user-specified intervals which cover [0,1] and which can be overlapping. We call these p-value buckets. We present algorithms that, with arbitrarily high probability, return a p-value bucket containing p. We prove that for both a bounded resampling risk and a finite runtime, overlapping buckets need to be employed, and that our methods both bound the resampling risk and guarantee a finite runtime for such overlapping buckets. To interpret decisions with overlapping buckets, we propose an extension of the star rating system. We demonstrate that our methods are suitable for use in standard software, including for low p-values occurring in multiple testing settings, and that they can be computationally more efficient than standard implementations.

AB - Software packages usually report the results of statistical tests using p-values. Users often interpret these by comparing them to standard thresholds, e.g. 0.1%, 1% and 5%, which is sometimes reinforced by a star rating (***, **, *). In this article, we consider an arbitrary statistical test whose p-value p is not available explicitly, but can be approximated by Monte Carlo samples, e.g. by bootstrap or permutation tests. The standard implementation of such tests usually draws a fixed number of samples to approximate p. However, the probability that the exact and the approximated p-value lie on different sides of a threshold (the resampling risk) can be high, particularly for p-values close to a threshold. We present a method to overcome this. We consider a finite set of user-specified intervals which cover [0,1] and which can be overlapping. We call these p-value buckets. We present algorithms that, with arbitrarily high probability, return a p-value bucket containing p. We prove that for both a bounded resampling risk and a finite runtime, overlapping buckets need to be employed, and that our methods both bound the resampling risk and guarantee a finite runtime for such overlapping buckets. To interpret decisions with overlapping buckets, we propose an extension of the star rating system. We demonstrate that our methods are suitable for use in standard software, including for low p-values occurring in multiple testing settings, and that they can be computationally more efficient than standard implementations.

KW - stat.ME

M3 - Journal article

JO - arxiv.org

JF - arxiv.org

ER -

Research

Electronic data

Links

Keywords