The optimal unbiased value estimator and its relation to LSTD, TD and MC

School Of Mathematical Sciences

Associated organisational units

Text available via DOI:

https://doi.org/10.1007/s10994-010-5220-9
Final published version

Keywords

Optimal unbiased value estimator, Maximum likelihood value estimator, Sufficient statistics, Lehmann-Scheffe theorem

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

The optimal unbiased value estimator and its relation to LSTD, TD and MC. / Grunewalder, S.; Obermayer, K.
In: Machine Learning, Vol. 83, No. 3, 06.2011, p. 289-330.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Grunewalder, S & Obermayer, K 2011, 'The optimal unbiased value estimator and its relation to LSTD, TD and MC', Machine Learning, vol. 83, no. 3, pp. 289-330. https://doi.org/10.1007/s10994-010-5220-9

APA

Grunewalder, S., & Obermayer, K. (2011). The optimal unbiased value estimator and its relation to LSTD, TD and MC. Machine Learning, 83(3), 289-330. https://doi.org/10.1007/s10994-010-5220-9

Vancouver

Grunewalder S, Obermayer K. The optimal unbiased value estimator and its relation to LSTD, TD and MC. Machine Learning. 2011 Jun;83(3):289-330. Epub 2010 Oct 29. doi: 10.1007/s10994-010-5220-9

Author

Grunewalder, S. ; Obermayer, K. / The optimal unbiased value estimator and its relation to LSTD, TD and MC. In: Machine Learning. 2011 ; Vol. 83, No. 3. pp. 289-330.

Bibtex

@article{0d5995295e1048af8e7145be6f8593ee,

title = "The optimal unbiased value estimator and its relation to LSTD, TD and MC",

abstract = "In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The reason for this is that at each state the bias is calculated with a different probability measure and due to the strong coupling by the Bellman equation it is typically not possible for a set of value estimators to be unbiased with respect to each of these measures. Furthermore, we derive relations of the MVU to MC and TD. The most important of these relations is the equivalence of MC to the MVU and to LSTD for undiscounted MRPs in which MC has the same amount of information. In the discounted case this equivalence does not hold anymore. For TD we show that it is essentially unbiased for acyclic MRPs and biased for cyclic MRPs. We also order estimators according to their risk and present counter-examples to show that no general ordering exists between the MVU and LSTD, between MC and LSTD and between TD and MC. Theoretical results are supported by examples and an empirical evaluation.",

keywords = "Optimal unbiased value estimator, Maximum likelihood value estimator, Sufficient statistics, Lehmann-Scheffe theorem",

author = "S. Grunewalder and K. Obermayer",

year = "2011",

month = jun,

doi = "10.1007/s10994-010-5220-9",

language = "English",

volume = "83",

pages = "289--330",

journal = "Machine Learning",

issn = "1573-0565",

publisher = "Springer Netherlands",

number = "3",

}

RIS

TY - JOUR

T1 - The optimal unbiased value estimator and its relation to LSTD, TD and MC

AU - Grunewalder, S.

AU - Obermayer, K.

PY - 2011/6

Y1 - 2011/6

N2 - In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The reason for this is that at each state the bias is calculated with a different probability measure and due to the strong coupling by the Bellman equation it is typically not possible for a set of value estimators to be unbiased with respect to each of these measures. Furthermore, we derive relations of the MVU to MC and TD. The most important of these relations is the equivalence of MC to the MVU and to LSTD for undiscounted MRPs in which MC has the same amount of information. In the discounted case this equivalence does not hold anymore. For TD we show that it is essentially unbiased for acyclic MRPs and biased for cyclic MRPs. We also order estimators according to their risk and present counter-examples to show that no general ordering exists between the MVU and LSTD, between MC and LSTD and between TD and MC. Theoretical results are supported by examples and an empirical evaluation.

AB - In this analytical study we derive the optimal unbiased value estimator (MVU) and compare its statistical risk to three well known value estimators: Temporal Difference learning (TD), Monte Carlo estimation (MC) and Least-Squares Temporal Difference Learning (LSTD). We demonstrate that LSTD is equivalent to the MVU if the Markov Reward Process (MRP) is acyclic and show that both differ for most cyclic MRPs as LSTD is then typically biased. More generally, we show that estimators that fulfill the Bellman equation can only be unbiased for special cyclic MRPs. The reason for this is that at each state the bias is calculated with a different probability measure and due to the strong coupling by the Bellman equation it is typically not possible for a set of value estimators to be unbiased with respect to each of these measures. Furthermore, we derive relations of the MVU to MC and TD. The most important of these relations is the equivalence of MC to the MVU and to LSTD for undiscounted MRPs in which MC has the same amount of information. In the discounted case this equivalence does not hold anymore. For TD we show that it is essentially unbiased for acyclic MRPs and biased for cyclic MRPs. We also order estimators according to their risk and present counter-examples to show that no general ordering exists between the MVU and LSTD, between MC and LSTD and between TD and MC. Theoretical results are supported by examples and an empirical evaluation.

KW - Optimal unbiased value estimator

KW - Maximum likelihood value estimator

KW - Sufficient statistics

KW - Lehmann-Scheffe theorem

U2 - 10.1007/s10994-010-5220-9

DO - 10.1007/s10994-010-5220-9

M3 - Journal article

VL - 83

SP - 289

EP - 330

JO - Machine Learning

JF - Machine Learning

SN - 1573-0565

IS - 3

ER -

Research

Associated organisational units

Links

Text available via DOI:

Keywords