Posterior weighted reinforcement learning with state uncertainty

School Of Mathematical Sciences

Associated organisational units

Text available via DOI:

https://doi.org/10.1162/neco.2010.01-09-948
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Posterior weighted reinforcement learning with state uncertainty. / Larsen, Tobias; Leslie, David S.; Collins, Edmund J. et al.
In: Neural Computation, Vol. 22, No. 5, 01.05.2010, p. 1149-1179.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Larsen, T, Leslie, DS, Collins, EJ & Bogacz, R 2010, 'Posterior weighted reinforcement learning with state uncertainty', Neural Computation, vol. 22, no. 5, pp. 1149-1179. https://doi.org/10.1162/neco.2010.01-09-948

APA

Larsen, T., Leslie, D. S., Collins, E. J., & Bogacz, R. (2010). Posterior weighted reinforcement learning with state uncertainty. Neural Computation, 22(5), 1149-1179. https://doi.org/10.1162/neco.2010.01-09-948

Vancouver

Larsen T, Leslie DS, Collins EJ, Bogacz R. Posterior weighted reinforcement learning with state uncertainty. Neural Computation. 2010 May 1;22(5):1149-1179. doi: 10.1162/neco.2010.01-09-948

Author

Larsen, Tobias ; Leslie, David S. ; Collins, Edmund J. et al. / Posterior weighted reinforcement learning with state uncertainty. In: Neural Computation. 2010 ; Vol. 22, No. 5. pp. 1149-1179.

Bibtex

@article{85455f66bbe14e289f39bfb17c6932bb,

title = "Posterior weighted reinforcement learning with state uncertainty",

abstract = "Reinforcement learning models generally assume that a stimulus is presented that allows a learner to unambiguously identify the state of nature, and the reward received is drawn from a distribution that depends on that state. However, in any natural environment, the stimulus is noisy. When there is state uncertainty, it is no longer immediately obvious how to perform reinforcement learning, since the observed reward cannot be unambiguously allocated to a state of the environment. This letter addresses the problem of incorporating state uncertainty in reinforcement learning models. We show that simply ignoring the uncertainty and allocating the reward to the most likely state of the environment results in incorrect value estimates. Furthermore, using only the information that is available before observing the reward also results in incorrect estimates. We therefore introduce a new technique, posterior weighted reinforcement learning, in which the estimates of state probabilities are updated according to the observed rewards (e.g., if a learner observes a reward usually associated with a particular state, this state becomes more likely). We show analytically that this modified algorithm can converge to correct reward estimates and confirm this with numerical experiments. The algorithm is shown to be a variant of the expectation-maximization algorithm, allowing rigorous convergence analyses to be carried out. A possible neural implementation of the algorithm in the cortico-basal-ganglia-thalamic network is presented, and experimental predictions of our model are discussed.",

author = "Tobias Larsen and Leslie, {David S.} and Collins, {Edmund J.} and Rafal Bogacz",

year = "2010",

month = may,

day = "1",

doi = "10.1162/neco.2010.01-09-948",

language = "English",

volume = "22",

pages = "1149--1179",

journal = "Neural Computation",

issn = "0899-7667",

publisher = "MIT Press Journals",

number = "5",

}

RIS

TY - JOUR

T1 - Posterior weighted reinforcement learning with state uncertainty

AU - Larsen, Tobias

AU - Leslie, David S.

AU - Collins, Edmund J.

AU - Bogacz, Rafal

PY - 2010/5/1

Y1 - 2010/5/1

N2 - Reinforcement learning models generally assume that a stimulus is presented that allows a learner to unambiguously identify the state of nature, and the reward received is drawn from a distribution that depends on that state. However, in any natural environment, the stimulus is noisy. When there is state uncertainty, it is no longer immediately obvious how to perform reinforcement learning, since the observed reward cannot be unambiguously allocated to a state of the environment. This letter addresses the problem of incorporating state uncertainty in reinforcement learning models. We show that simply ignoring the uncertainty and allocating the reward to the most likely state of the environment results in incorrect value estimates. Furthermore, using only the information that is available before observing the reward also results in incorrect estimates. We therefore introduce a new technique, posterior weighted reinforcement learning, in which the estimates of state probabilities are updated according to the observed rewards (e.g., if a learner observes a reward usually associated with a particular state, this state becomes more likely). We show analytically that this modified algorithm can converge to correct reward estimates and confirm this with numerical experiments. The algorithm is shown to be a variant of the expectation-maximization algorithm, allowing rigorous convergence analyses to be carried out. A possible neural implementation of the algorithm in the cortico-basal-ganglia-thalamic network is presented, and experimental predictions of our model are discussed.

AB - Reinforcement learning models generally assume that a stimulus is presented that allows a learner to unambiguously identify the state of nature, and the reward received is drawn from a distribution that depends on that state. However, in any natural environment, the stimulus is noisy. When there is state uncertainty, it is no longer immediately obvious how to perform reinforcement learning, since the observed reward cannot be unambiguously allocated to a state of the environment. This letter addresses the problem of incorporating state uncertainty in reinforcement learning models. We show that simply ignoring the uncertainty and allocating the reward to the most likely state of the environment results in incorrect value estimates. Furthermore, using only the information that is available before observing the reward also results in incorrect estimates. We therefore introduce a new technique, posterior weighted reinforcement learning, in which the estimates of state probabilities are updated according to the observed rewards (e.g., if a learner observes a reward usually associated with a particular state, this state becomes more likely). We show analytically that this modified algorithm can converge to correct reward estimates and confirm this with numerical experiments. The algorithm is shown to be a variant of the expectation-maximization algorithm, allowing rigorous convergence analyses to be carried out. A possible neural implementation of the algorithm in the cortico-basal-ganglia-thalamic network is presented, and experimental predictions of our model are discussed.

U2 - 10.1162/neco.2010.01-09-948

DO - 10.1162/neco.2010.01-09-948

M3 - Journal article

VL - 22

SP - 1149

EP - 1179

JO - Neural Computation

JF - Neural Computation

SN - 0899-7667

IS - 5

ER -

Research

Associated organisational units

Links

Text available via DOI: