- 2019BakerPhD
Final published version, 1 MB, PDF document

Available under license: CC BY-ND: Creative Commons Attribution-NoDerivatives 4.0 International License

Research output: Thesis › Doctoral Thesis

Published

**Large-scale Bayesian computation using Stochastic Gradient Markov Chain Monte Carlo.** / Baker, Jack.

Research output: Thesis › Doctoral Thesis

Baker, J 2019, 'Large-scale Bayesian computation using Stochastic Gradient Markov Chain Monte Carlo', PhD, Lancaster University. https://doi.org/10.17635/lancaster/thesis/512

Baker, J. (2019). *Large-scale Bayesian computation using Stochastic Gradient Markov Chain Monte Carlo*. Lancaster University. https://doi.org/10.17635/lancaster/thesis/512

Baker J. Large-scale Bayesian computation using Stochastic Gradient Markov Chain Monte Carlo. Lancaster University, 2019. 221 p. https://doi.org/10.17635/lancaster/thesis/512

@phdthesis{65c3c750f123490fac70d38bb7cea04e,

title = "Large-scale Bayesian computation using Stochastic Gradient Markov Chain Monte Carlo",

abstract = "Markov chain Monte Carlo (MCMC), one of the most popular methods for inference on Bayesian models, scales poorly with dataset size. This is because it requires one or more calculations over the full dataset at each iteration. Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular MCMC method that aims to be more scalable at large datasets. It only requires a subset of the full data at each iteration. This thesis builds upon the SGMCMC literature by providing contributions that improve the efficiency of SGMCMC; providing software that improves its ease-of-use; and removes large biases in the method for an important class of model. While SGMCMC has improved per-iteration computational cost over traditional MCMC, there have been empirical results suggesting that its overall computational cost (i.e. the cost for the algorithm to reach an arbitrary level of accuracy) is still $O(N)$, where $N$ is the dataset size. In light of this, we show how control variates can be used to develop an SGMCMC algorithm of $O(1)$, subject to two one-off preprocessing steps which each require a single pass through the dataset. While SGMCMC has gained significant popularity in the machine learning community, uptake among the statistics community has been slower. We suggest this may be due to lack of software, so as part of the contributions in this thesis we provide an R software package that automates much of the procedures required to build SGMCMC algorithms. Finally, we show that current algorithms for sampling from the simplex space using SGMCMC have inherent biases, especially when some of the parameter components are close to zero. To get around this, we develop an algorithm that is provably asymptotically unbiased. We empirically demonstrate its performance on a latent Dirichlet allocation model and a Dirichlet process model.",

author = "Jack Baker",

year = "2019",

doi = "10.17635/lancaster/thesis/512",

language = "English",

publisher = "Lancaster University",

school = "Lancaster University",

}

TY - THES

T1 - Large-scale Bayesian computation using Stochastic Gradient Markov Chain Monte Carlo

AU - Baker, Jack

PY - 2019

Y1 - 2019

N2 - Markov chain Monte Carlo (MCMC), one of the most popular methods for inference on Bayesian models, scales poorly with dataset size. This is because it requires one or more calculations over the full dataset at each iteration. Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular MCMC method that aims to be more scalable at large datasets. It only requires a subset of the full data at each iteration. This thesis builds upon the SGMCMC literature by providing contributions that improve the efficiency of SGMCMC; providing software that improves its ease-of-use; and removes large biases in the method for an important class of model. While SGMCMC has improved per-iteration computational cost over traditional MCMC, there have been empirical results suggesting that its overall computational cost (i.e. the cost for the algorithm to reach an arbitrary level of accuracy) is still $O(N)$, where $N$ is the dataset size. In light of this, we show how control variates can be used to develop an SGMCMC algorithm of $O(1)$, subject to two one-off preprocessing steps which each require a single pass through the dataset. While SGMCMC has gained significant popularity in the machine learning community, uptake among the statistics community has been slower. We suggest this may be due to lack of software, so as part of the contributions in this thesis we provide an R software package that automates much of the procedures required to build SGMCMC algorithms. Finally, we show that current algorithms for sampling from the simplex space using SGMCMC have inherent biases, especially when some of the parameter components are close to zero. To get around this, we develop an algorithm that is provably asymptotically unbiased. We empirically demonstrate its performance on a latent Dirichlet allocation model and a Dirichlet process model.

AB - Markov chain Monte Carlo (MCMC), one of the most popular methods for inference on Bayesian models, scales poorly with dataset size. This is because it requires one or more calculations over the full dataset at each iteration. Stochastic gradient Markov chain Monte Carlo (SGMCMC) has become a popular MCMC method that aims to be more scalable at large datasets. It only requires a subset of the full data at each iteration. This thesis builds upon the SGMCMC literature by providing contributions that improve the efficiency of SGMCMC; providing software that improves its ease-of-use; and removes large biases in the method for an important class of model. While SGMCMC has improved per-iteration computational cost over traditional MCMC, there have been empirical results suggesting that its overall computational cost (i.e. the cost for the algorithm to reach an arbitrary level of accuracy) is still $O(N)$, where $N$ is the dataset size. In light of this, we show how control variates can be used to develop an SGMCMC algorithm of $O(1)$, subject to two one-off preprocessing steps which each require a single pass through the dataset. While SGMCMC has gained significant popularity in the machine learning community, uptake among the statistics community has been slower. We suggest this may be due to lack of software, so as part of the contributions in this thesis we provide an R software package that automates much of the procedures required to build SGMCMC algorithms. Finally, we show that current algorithms for sampling from the simplex space using SGMCMC have inherent biases, especially when some of the parameter components are close to zero. To get around this, we develop an algorithm that is provably asymptotically unbiased. We empirically demonstrate its performance on a latent Dirichlet allocation model and a Dirichlet process model.

U2 - 10.17635/lancaster/thesis/512

DO - 10.17635/lancaster/thesis/512

M3 - Doctoral Thesis

PB - Lancaster University

ER -