Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data

School Of Mathematical Sciences

Associated organisational units

Electronic data

JMLR_GPDPA_paper
Accepted author manuscript, 699 KB, PDF document
Available under license: CC BY: Creative Commons Attribution 4.0 International License

Keywords

stat.CO, q-bio.GN, stat.ML

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data. / Hocking, Toby Dylan; Rigaill, Guillem; Fearnhead, Paul et al.
In: Journal of Machine Learning Research, Vol. 21, 31.03.2020, p. 1-40.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Bibtex

@article{5d488d67e72f47b59bf794bb9826e2bd,

title = "Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data",

abstract = "Peak detection in genomic data involves segmenting counts of DNA sequence reads aligned to different locations of a chromosome. The goal is to detect peaks with higher counts, and filter out background noise with lower counts. Most existing algorithms for this problem are unsupervised heuristics tailored to patterns in specific data types. We propose a supervised framework for this problem, using optimal changepoint detection models with learned penalty functions. We propose the first dynamic programming algorithm that is guaranteed to compute the optimal solution to changepoint detection problems with constraints between adjacent segment mean parameters. Implementing this algorithm requires the choice of penalty parameter that determines the number of segments that are estimated. We show how the supervised learning ideas of Rigaill et al. (2013) can be used to choose this penalty. We compare the resulting implementation of our algorithm to several baselines in a benchmark of labeled ChIP-seq data sets with two dierent patterns (broad H3K36me3 data and sharp H3K4me3 data). Whereas baseline unsupervised methods only provide accuratepeak detection for a single pattern, our supervised method achieves state-of-the-artaccuracy in all data sets. The log-linear timings of our proposed dynamic programming algorithm make it scalable to the large genomic data sets that are now common. Our implementation is available in the PeakSegOptimal R package on CRAN.",

keywords = "stat.CO, q-bio.GN, stat.ML",

author = "Hocking, {Toby Dylan} and Guillem Rigaill and Paul Fearnhead and Guillaume Bourque",

note = "Early version of this work is available as arXiv.1703.03352",

year = "2020",

month = mar,

day = "31",

language = "English",

volume = "21",

pages = "1--40",

journal = "Journal of Machine Learning Research",

issn = "1532-4435",

publisher = "Microtome Publishing",

}

RIS

TY - JOUR

T1 - Constrained Dynamic Programming and Supervised Penalty Learning Algorithms for Peak Detection in Genomic Data

AU - Hocking, Toby Dylan

AU - Rigaill, Guillem

AU - Fearnhead, Paul

AU - Bourque, Guillaume

N1 - Early version of this work is available as arXiv.1703.03352

PY - 2020/3/31

Y1 - 2020/3/31

N2 - Peak detection in genomic data involves segmenting counts of DNA sequence reads aligned to different locations of a chromosome. The goal is to detect peaks with higher counts, and filter out background noise with lower counts. Most existing algorithms for this problem are unsupervised heuristics tailored to patterns in specific data types. We propose a supervised framework for this problem, using optimal changepoint detection models with learned penalty functions. We propose the first dynamic programming algorithm that is guaranteed to compute the optimal solution to changepoint detection problems with constraints between adjacent segment mean parameters. Implementing this algorithm requires the choice of penalty parameter that determines the number of segments that are estimated. We show how the supervised learning ideas of Rigaill et al. (2013) can be used to choose this penalty. We compare the resulting implementation of our algorithm to several baselines in a benchmark of labeled ChIP-seq data sets with two dierent patterns (broad H3K36me3 data and sharp H3K4me3 data). Whereas baseline unsupervised methods only provide accuratepeak detection for a single pattern, our supervised method achieves state-of-the-artaccuracy in all data sets. The log-linear timings of our proposed dynamic programming algorithm make it scalable to the large genomic data sets that are now common. Our implementation is available in the PeakSegOptimal R package on CRAN.

AB - Peak detection in genomic data involves segmenting counts of DNA sequence reads aligned to different locations of a chromosome. The goal is to detect peaks with higher counts, and filter out background noise with lower counts. Most existing algorithms for this problem are unsupervised heuristics tailored to patterns in specific data types. We propose a supervised framework for this problem, using optimal changepoint detection models with learned penalty functions. We propose the first dynamic programming algorithm that is guaranteed to compute the optimal solution to changepoint detection problems with constraints between adjacent segment mean parameters. Implementing this algorithm requires the choice of penalty parameter that determines the number of segments that are estimated. We show how the supervised learning ideas of Rigaill et al. (2013) can be used to choose this penalty. We compare the resulting implementation of our algorithm to several baselines in a benchmark of labeled ChIP-seq data sets with two dierent patterns (broad H3K36me3 data and sharp H3K4me3 data). Whereas baseline unsupervised methods only provide accuratepeak detection for a single pattern, our supervised method achieves state-of-the-artaccuracy in all data sets. The log-linear timings of our proposed dynamic programming algorithm make it scalable to the large genomic data sets that are now common. Our implementation is available in the PeakSegOptimal R package on CRAN.

KW - stat.CO

KW - q-bio.GN

KW - stat.ML

M3 - Journal article

VL - 21

SP - 1

EP - 40

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

SN - 1532-4435

ER -

Research

Associated organisational units

Electronic data

Links

Keywords