Value Iteration for Long-Run Average Reward in Markov Decision Processes.

LANCASTER UNIVERSITY LEIPZIG

Text available via DOI:

https://doi.org/10.1007/978-3-319-63387-9_10
Final published version

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

Value Iteration for Long-Run Average Reward in Markov Decision Processes. / Ashok, Pranav; Chatterjee, Krishnendu; Daca, Przemyslaw et al.
CAV 2017: Computer Aided Verification . ed. / R. Majumdar; V. Kunčak. Cham: Springer, 2017. p. 201-221 (Lecture Notes in Computer Science ; Vol. 10426).

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Ashok, P, Chatterjee, K, Daca, P, Kretínský, J & Meggendorfer, T 2017, Value Iteration for Long-Run Average Reward in Markov Decision Processes. in R Majumdar & V Kunčak (eds), CAV 2017: Computer Aided Verification . Lecture Notes in Computer Science , vol. 10426, Springer, Cham, pp. 201-221. https://doi.org/10.1007/978-3-319-63387-9_10

APA

Ashok, P., Chatterjee, K., Daca, P., Kretínský, J., & Meggendorfer, T. (2017). Value Iteration for Long-Run Average Reward in Markov Decision Processes. In R. Majumdar, & V. Kunčak (Eds.), CAV 2017: Computer Aided Verification (pp. 201-221). (Lecture Notes in Computer Science ; Vol. 10426). Springer. https://doi.org/10.1007/978-3-319-63387-9_10

Vancouver

Ashok P, Chatterjee K, Daca P, Kretínský J, Meggendorfer T. Value Iteration for Long-Run Average Reward in Markov Decision Processes. In Majumdar R, Kunčak V, editors, CAV 2017: Computer Aided Verification . Cham: Springer. 2017. p. 201-221. (Lecture Notes in Computer Science ). doi: 10.1007/978-3-319-63387-9_10

Author

Ashok, Pranav ; Chatterjee, Krishnendu ; Daca, Przemyslaw et al. / Value Iteration for Long-Run Average Reward in Markov Decision Processes. CAV 2017: Computer Aided Verification . editor / R. Majumdar ; V. Kunčak. Cham : Springer, 2017. pp. 201-221 (Lecture Notes in Computer Science ).

Bibtex

@inproceedings{2644ea6690074b51a0add85d594d552f,

title = "Value Iteration for Long-Run Average Reward in Markov Decision Processes.",

abstract = "Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Long-run average rewards provide a mathematically elegant formalism for expressing long term performance. Value iteration (VI) is one of the simplest and most efficient algorithmic approaches to MDPs with other properties, such as reachability objectives. Unfortunately, a naive extension of VI does not work for MDPs with long-run average rewards, as there is no known stopping criterion. In this work our contributions are threefold. (1) We refute a conjecture related to stopping criteria for MDPs with long-run average rewards. (2) We present two practical algorithms for MDPs with long-run average rewards based on VI. First, we show that a combination of applying VI locally for each maximal end-component (MEC) and VI for reachability objectives can provide approximation guarantees. Second, extending the above approach with a simulation-guided on-demand variant of VI, we present an anytime algorithm that is able to deal with very large models. (3) Finally, we present experimental results showing that our methods significantly outperform the standard approaches on several benchmarks.",

author = "Pranav Ashok and Krishnendu Chatterjee and Przemyslaw Daca and Jan Kret{\'i}nsk{\'y} and Tobias Meggendorfer",

year = "2017",

month = jul,

day = "13",

doi = "10.1007/978-3-319-63387-9_10",

language = "English",

isbn = "9783319633862",

series = "Lecture Notes in Computer Science ",

publisher = "Springer",

pages = "201--221",

editor = "R. Majumdar and V. Kun{\v c}ak",

booktitle = "CAV 2017",

}

RIS

TY - GEN

T1 - Value Iteration for Long-Run Average Reward in Markov Decision Processes.

AU - Ashok, Pranav

AU - Chatterjee, Krishnendu

AU - Daca, Przemyslaw

AU - Kretínský, Jan

AU - Meggendorfer, Tobias

PY - 2017/7/13

Y1 - 2017/7/13

N2 - Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Long-run average rewards provide a mathematically elegant formalism for expressing long term performance. Value iteration (VI) is one of the simplest and most efficient algorithmic approaches to MDPs with other properties, such as reachability objectives. Unfortunately, a naive extension of VI does not work for MDPs with long-run average rewards, as there is no known stopping criterion. In this work our contributions are threefold. (1) We refute a conjecture related to stopping criteria for MDPs with long-run average rewards. (2) We present two practical algorithms for MDPs with long-run average rewards based on VI. First, we show that a combination of applying VI locally for each maximal end-component (MEC) and VI for reachability objectives can provide approximation guarantees. Second, extending the above approach with a simulation-guided on-demand variant of VI, we present an anytime algorithm that is able to deal with very large models. (3) Finally, we present experimental results showing that our methods significantly outperform the standard approaches on several benchmarks.

AB - Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Long-run average rewards provide a mathematically elegant formalism for expressing long term performance. Value iteration (VI) is one of the simplest and most efficient algorithmic approaches to MDPs with other properties, such as reachability objectives. Unfortunately, a naive extension of VI does not work for MDPs with long-run average rewards, as there is no known stopping criterion. In this work our contributions are threefold. (1) We refute a conjecture related to stopping criteria for MDPs with long-run average rewards. (2) We present two practical algorithms for MDPs with long-run average rewards based on VI. First, we show that a combination of applying VI locally for each maximal end-component (MEC) and VI for reachability objectives can provide approximation guarantees. Second, extending the above approach with a simulation-guided on-demand variant of VI, we present an anytime algorithm that is able to deal with very large models. (3) Finally, we present experimental results showing that our methods significantly outperform the standard approaches on several benchmarks.

U2 - 10.1007/978-3-319-63387-9_10

DO - 10.1007/978-3-319-63387-9_10

M3 - Conference contribution/Paper

SN - 9783319633862

T3 - Lecture Notes in Computer Science

SP - 201

EP - 221

BT - CAV 2017

A2 - Majumdar, R.

A2 - Kunčak, V.

PB - Springer

CY - Cham

ER -

Research

Links

Text available via DOI: