Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data

School Of Mathematical Sciences

Associated organisational units

Text available via DOI:

https://doi.org/10.18637/jss.v101.i10
Final published version
Available under license: CC BY

Keywords

Dynamic programming, optimal changepoint detection, peak detection, genomic data, R.

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data. / Hocking, Toby Dylan; Rigaill, Guillem; Fearnhead, Paul et al.
In: Journal of Statistical Software, Vol. 101, No. 10, 07.02.2022.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Bibtex

@article{d5474ccc86b74ad79f7de38386f2e844,

title = "Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data",

abstract = " We describe a new algorithm and R package for peak detection in genomic data sets using constrained changepoint algorithms. These detect changes from background to peak regions by imposing the constraint that the mean should alternately increase then decrease. An existing algorithm for this problem exists, and gives state-of-the-art accuracy results, but it is computationally expensive when the number of changes is large. We propose the GFPOP algorithm that jointly estimates the number of peaks and their locations by minimizing a cost function which consists of a data fitting term and a penalty for each changepoint. Empirically this algorithm has a cost that is $O(N \log(N))$ for analysing data of length $N$. We also propose a sequential search algorithm that finds the best solution with $K$ segments in $O(\log(K)N \log(N))$ time, which is much faster than the previous $O(KN \log(N))$ algorithm. We show that our disk-based implementation in the PeakSegDisk R package can be used to quickly compute constrained optimal models with many changepoints, which are needed to analyze typical genomic data sets that have tens of millions of observations. ",

keywords = "Dynamic programming, optimal changepoint detection, peak detection, genomic data, R.",

author = "Hocking, {Toby Dylan} and Guillem Rigaill and Paul Fearnhead and Guillaume Bourque",

year = "2022",

month = feb,

day = "7",

doi = "10.18637/jss.v101.i10",

language = "English",

volume = "101",

journal = "Journal of Statistical Software",

issn = "1548-7660",

publisher = "University of California at Los Angeles",

number = "10",

}

RIS

TY - JOUR

T1 - Generalized Functional Pruning Optimal Partitioning (GFPOP) for Constrained Changepoint Detection in Genomic Data

AU - Hocking, Toby Dylan

AU - Rigaill, Guillem

AU - Fearnhead, Paul

AU - Bourque, Guillaume

PY - 2022/2/7

Y1 - 2022/2/7

N2 - We describe a new algorithm and R package for peak detection in genomic data sets using constrained changepoint algorithms. These detect changes from background to peak regions by imposing the constraint that the mean should alternately increase then decrease. An existing algorithm for this problem exists, and gives state-of-the-art accuracy results, but it is computationally expensive when the number of changes is large. We propose the GFPOP algorithm that jointly estimates the number of peaks and their locations by minimizing a cost function which consists of a data fitting term and a penalty for each changepoint. Empirically this algorithm has a cost that is $O(N \log(N))$ for analysing data of length $N$. We also propose a sequential search algorithm that finds the best solution with $K$ segments in $O(\log(K)N \log(N))$ time, which is much faster than the previous $O(KN \log(N))$ algorithm. We show that our disk-based implementation in the PeakSegDisk R package can be used to quickly compute constrained optimal models with many changepoints, which are needed to analyze typical genomic data sets that have tens of millions of observations.

AB - We describe a new algorithm and R package for peak detection in genomic data sets using constrained changepoint algorithms. These detect changes from background to peak regions by imposing the constraint that the mean should alternately increase then decrease. An existing algorithm for this problem exists, and gives state-of-the-art accuracy results, but it is computationally expensive when the number of changes is large. We propose the GFPOP algorithm that jointly estimates the number of peaks and their locations by minimizing a cost function which consists of a data fitting term and a penalty for each changepoint. Empirically this algorithm has a cost that is $O(N \log(N))$ for analysing data of length $N$. We also propose a sequential search algorithm that finds the best solution with $K$ segments in $O(\log(K)N \log(N))$ time, which is much faster than the previous $O(KN \log(N))$ algorithm. We show that our disk-based implementation in the PeakSegDisk R package can be used to quickly compute constrained optimal models with many changepoints, which are needed to analyze typical genomic data sets that have tens of millions of observations.

KW - Dynamic programming

KW - optimal changepoint detection

KW - peak detection

KW - genomic data

KW - R.

U2 - 10.18637/jss.v101.i10

DO - 10.18637/jss.v101.i10

M3 - Journal article

VL - 101

JO - Journal of Statistical Software

JF - Journal of Statistical Software

SN - 1548-7660

IS - 10

ER -

Research

Associated organisational units

Links

Text available via DOI:

Keywords