On-line adaptation of exploration in the one-armed bandit with covariates problem

Data Science Institute

Text available via DOI:

https://doi.org/10.1109/ICMLA.2010.74
Final published version

Keywords

Exploration-exploitation tradeoff, On-line learning, One-armed bandit problem, Sequential decision making

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

On-line adaptation of exploration in the one-armed bandit with covariates problem. / Sykulski, Adam M.; Adams, Niall M.; Jennings, Nicholas R.
Proceedings - 9th International Conference on Machine Learning and Applications, ICMLA 2010. IEEE, 2010. p. 459-464.

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Sykulski, AM, Adams, NM & Jennings, NR 2010, On-line adaptation of exploration in the one-armed bandit with covariates problem. in Proceedings - 9th International Conference on Machine Learning and Applications, ICMLA 2010. IEEE, pp. 459-464. https://doi.org/10.1109/ICMLA.2010.74

APA

Sykulski, A. M., Adams, N. M., & Jennings, N. R. (2010). On-line adaptation of exploration in the one-armed bandit with covariates problem. In Proceedings - 9th International Conference on Machine Learning and Applications, ICMLA 2010 (pp. 459-464). IEEE. https://doi.org/10.1109/ICMLA.2010.74

Vancouver

Sykulski AM, Adams NM, Jennings NR. On-line adaptation of exploration in the one-armed bandit with covariates problem. In Proceedings - 9th International Conference on Machine Learning and Applications, ICMLA 2010. IEEE. 2010. p. 459-464 doi: 10.1109/ICMLA.2010.74

Author

Sykulski, Adam M. ; Adams, Niall M. ; Jennings, Nicholas R. / On-line adaptation of exploration in the one-armed bandit with covariates problem. Proceedings - 9th International Conference on Machine Learning and Applications, ICMLA 2010. IEEE, 2010. pp. 459-464

Bibtex

@inproceedings{88fad17fc5b546c9a187c76e074a761f,

title = "On-line adaptation of exploration in the one-armed bandit with covariates problem",

abstract = "Many sequential decision making problems require an agent to balance exploration and exploitation to maximise long-term reward. Existing policies that address this tradeoff typically have parameters that are set a priori to control the amount of exploration. In finite-time problems, the optimal values of these parameters are highly dependent on the problem faced. In this paper, we propose adapting the amount of exploration performed on-line, as information is gathered by the agent. To this end we introduce a novel algorithm, e-ADAPT, which has no free parameters. The algorithm adapts as it plays and sequentially chooses whether to explore or exploit, driven by the amount of uncertainty in the system. We provide simulation results for the one armed bandit with covariates problem, which demonstrate the effectiveness of e-ADAPT to correctly control the amount of exploration in finite-time problems and yield rewards that are close to optimally tuned off-line policies. Furthermore, we show that e-ADAPT is robust to a high-dimensional covariate, as well as misspecified models. Finally, we describe how our methods could be extended to other sequential decision making problems, such as dynamic bandit problems with changing reward structures.",

keywords = "Exploration-exploitation tradeoff, On-line learning, One-armed bandit problem, Sequential decision making",

author = "Sykulski, {Adam M.} and Adams, {Niall M.} and Jennings, {Nicholas R.}",

year = "2010",

doi = "10.1109/ICMLA.2010.74",

language = "English",

isbn = "9780769543000",

pages = "459--464",

booktitle = "Proceedings - 9th International Conference on Machine Learning and Applications, ICMLA 2010",

publisher = "IEEE",

}

RIS

TY - GEN

T1 - On-line adaptation of exploration in the one-armed bandit with covariates problem

AU - Sykulski, Adam M.

AU - Adams, Niall M.

AU - Jennings, Nicholas R.

PY - 2010

Y1 - 2010

N2 - Many sequential decision making problems require an agent to balance exploration and exploitation to maximise long-term reward. Existing policies that address this tradeoff typically have parameters that are set a priori to control the amount of exploration. In finite-time problems, the optimal values of these parameters are highly dependent on the problem faced. In this paper, we propose adapting the amount of exploration performed on-line, as information is gathered by the agent. To this end we introduce a novel algorithm, e-ADAPT, which has no free parameters. The algorithm adapts as it plays and sequentially chooses whether to explore or exploit, driven by the amount of uncertainty in the system. We provide simulation results for the one armed bandit with covariates problem, which demonstrate the effectiveness of e-ADAPT to correctly control the amount of exploration in finite-time problems and yield rewards that are close to optimally tuned off-line policies. Furthermore, we show that e-ADAPT is robust to a high-dimensional covariate, as well as misspecified models. Finally, we describe how our methods could be extended to other sequential decision making problems, such as dynamic bandit problems with changing reward structures.

AB - Many sequential decision making problems require an agent to balance exploration and exploitation to maximise long-term reward. Existing policies that address this tradeoff typically have parameters that are set a priori to control the amount of exploration. In finite-time problems, the optimal values of these parameters are highly dependent on the problem faced. In this paper, we propose adapting the amount of exploration performed on-line, as information is gathered by the agent. To this end we introduce a novel algorithm, e-ADAPT, which has no free parameters. The algorithm adapts as it plays and sequentially chooses whether to explore or exploit, driven by the amount of uncertainty in the system. We provide simulation results for the one armed bandit with covariates problem, which demonstrate the effectiveness of e-ADAPT to correctly control the amount of exploration in finite-time problems and yield rewards that are close to optimally tuned off-line policies. Furthermore, we show that e-ADAPT is robust to a high-dimensional covariate, as well as misspecified models. Finally, we describe how our methods could be extended to other sequential decision making problems, such as dynamic bandit problems with changing reward structures.

KW - Exploration-exploitation tradeoff

KW - On-line learning

KW - One-armed bandit problem

KW - Sequential decision making

U2 - 10.1109/ICMLA.2010.74

DO - 10.1109/ICMLA.2010.74

M3 - Conference contribution/Paper

SN - 9780769543000

SP - 459

EP - 464

BT - Proceedings - 9th International Conference on Machine Learning and Applications, ICMLA 2010

PB - IEEE

ER -

Research

Links

Text available via DOI:

Keywords