Optimistic Bayesian sampling in contextual-bandit problems

School Of Mathematical Sciences

Associated organisational units

Electronic data

may12a
Final published version, 9.53 MB, PDF document
Available under license: None

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Optimistic Bayesian sampling in contextual-bandit problems. / May, Benedict C.; Korda, Nathan; Lee, Anthony et al.
In: Journal of Machine Learning Research, Vol. 13, 06.2012, p. 2069-2106.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

May, BC, Korda, N, Lee, A & Leslie, DS 2012, 'Optimistic Bayesian sampling in contextual-bandit problems', Journal of Machine Learning Research, vol. 13, pp. 2069-2106. <http://jmlr.org/papers/v13/may12a.html>

APA

May, B. C., Korda, N., Lee, A., & Leslie, D. S. (2012). Optimistic Bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13, 2069-2106. http://jmlr.org/papers/v13/may12a.html

Vancouver

May BC, Korda N, Lee A, Leslie DS. Optimistic Bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research. 2012 Jun;13:2069-2106.

Author

May, Benedict C. ; Korda, Nathan ; Lee, Anthony et al. / Optimistic Bayesian sampling in contextual-bandit problems. In: Journal of Machine Learning Research. 2012 ; Vol. 13. pp. 2069-2106.

Bibtex

@article{601019b1f1d94189899bb43ace66b4de,

title = "Optimistic Bayesian sampling in contextual-bandit problems",

abstract = "In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson's method throughout. ",

author = "May, {Benedict C.} and Nathan Korda and Anthony Lee and Leslie, {David S.}",

note = "This research was undertaken as part of the ALADDIN (Autonomous Learning Agents for Decentralised Data and Information Networks) project and is jointly funded by a BAE Systems and EPRSC (Engineering and Physical Sciences Research Council) strategic partnership (EP/C548051/1).",

year = "2012",

month = jun,

language = "English",

volume = "13",

pages = "2069--2106",

journal = "Journal of Machine Learning Research",

issn = "1532-4435",

publisher = "Microtome Publishing",

}

RIS

TY - JOUR

T1 - Optimistic Bayesian sampling in contextual-bandit problems

AU - May, Benedict C.

AU - Korda, Nathan

AU - Lee, Anthony

AU - Leslie, David S.

N1 - This research was undertaken as part of the ALADDIN (Autonomous Learning Agents for Decentralised Data and Information Networks) project and is jointly funded by a BAE Systems and EPRSC (Engineering and Physical Sciences Research Council) strategic partnership (EP/C548051/1).

PY - 2012/6

Y1 - 2012/6

N2 - In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson's method throughout.

AB - In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson's method throughout.

M3 - Journal article

VL - 13

SP - 2069

EP - 2106

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

SN - 1532-4435

ER -

Research

Associated organisational units

Electronic data

Links