Modelling transition dynamics in MDPs with RKHS embeddings

School Of Mathematical Sciences

Associated organisational units

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Modelling transition dynamics in MDPs with RKHS embeddings. / Grunewalder, S.; Lever, G.; Baldassarre, L. et al.
Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. 2012.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Grunewalder, S, Lever, G, Baldassarre, L, Pontil, M & Gretton, A 2012, Modelling transition dynamics in MDPs with RKHS embeddings. in Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. <http://icml.cc/2012/papers/301.pdf>

APA

Grunewalder, S., Lever, G., Baldassarre, L., Pontil, M., & Gretton, A. (2012). Modelling transition dynamics in MDPs with RKHS embeddings. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012 http://icml.cc/2012/papers/301.pdf

Vancouver

Grunewalder S, Lever G, Baldassarre L, Pontil M, Gretton A. Modelling transition dynamics in MDPs with RKHS embeddings. In Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. 2012

Author

Grunewalder, S. ; Lever, G. ; Baldassarre, L. et al. / Modelling transition dynamics in MDPs with RKHS embeddings. Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. 2012.

Bibtex

@inproceedings{80890f2a5fe840e28264a957f61cadfe,

title = "Modelling transition dynamics in MDPs with RKHS embeddings",

abstract = "We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We are able to provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the recent NPDP method. Our approach achieves better performance in all experiments.",

author = "S. Grunewalder and G. Lever and L. Baldassarre and M. Pontil and A. Gretton",

year = "2012",

language = "English",

booktitle = "Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012",

}

RIS

TY - GEN

T1 - Modelling transition dynamics in MDPs with RKHS embeddings

AU - Grunewalder, S.

AU - Lever, G.

AU - Baldassarre, L.

AU - Pontil, M.

AU - Gretton, A.

PY - 2012

Y1 - 2012

N2 - We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We are able to provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the recent NPDP method. Our approach achieves better performance in all experiments.

AB - We propose a new, nonparametric approach to learning and representing transition dynamics in Markov decision processes (MDPs), which can be combined easily with dynamic programming methods for policy optimisation and value estimation. This approach makes use of a recently developed representation of conditional distributions as embeddings in a reproducing kernel Hilbert space (RKHS). Such representations bypass the need for estimating transition probabilities or densities, and apply to any domain on which kernels can be defined. This avoids the need to calculate intractable integrals, since expectations are represented as RKHS inner products whose computation has linear complexity in the number of points used to represent the embedding. We are able to provide guarantees for the proposed applications in MDPs: in the context of a value iteration algorithm, we prove convergence to either the optimal policy, or to the closest projection of the optimal policy in our model class (an RKHS), under reasonable assumptions. In experiments, we investigate a learning task in a typical classical control setting (the under-actuated pendulum), and on a navigation problem where only images from a sensor are observed. For policy optimisation we compare with least-squares policy iteration where a Gaussian process is used for value function estimation. For value estimation we also compare to the recent NPDP method. Our approach achieves better performance in all experiments.

M3 - Conference contribution/Paper

BT - Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012

ER -

Research

Associated organisational units

Links