A Novel Implementation of Q-Learning for the Whittle Index

Associated organisational units

Text available via DOI:

https://doi.org/10.1007/978-3-030-92511-6_10
Final published version

Keywords

Multi-armed bandits, Restless bandits, Reinforcement learning, Q-learning, Markov decision processes

View graph of relations

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Published

Standard

A Novel Implementation of Q-Learning for the Whittle Index. / Gibson, Lachlan J.; Jacko, Peter; Nazarathy, Yoni.
Performance Evaluation Methodologies and Tools - 14th EAI International Conference, VALUETOOLS 2021, Proceedings. ed. / Qianchuan Zhao; Li Xia. Cham: Springer, 2021. p. 154-170 (Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST; Vol. 404 LNICST).

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review

Harvard

Gibson, LJ, Jacko, P & Nazarathy, Y 2021, A Novel Implementation of Q-Learning for the Whittle Index. in Q Zhao & L Xia (eds), Performance Evaluation Methodologies and Tools - 14th EAI International Conference, VALUETOOLS 2021, Proceedings. Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST, vol. 404 LNICST, Springer, Cham, pp. 154-170. https://doi.org/10.1007/978-3-030-92511-6_10

APA

Gibson, L. J., Jacko, P., & Nazarathy, Y. (2021). A Novel Implementation of Q-Learning for the Whittle Index. In Q. Zhao, & L. Xia (Eds.), Performance Evaluation Methodologies and Tools - 14th EAI International Conference, VALUETOOLS 2021, Proceedings (pp. 154-170). (Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST; Vol. 404 LNICST). Springer. https://doi.org/10.1007/978-3-030-92511-6_10

Vancouver

Gibson LJ, Jacko P, Nazarathy Y. A Novel Implementation of Q-Learning for the Whittle Index. In Zhao Q, Xia L, editors, Performance Evaluation Methodologies and Tools - 14th EAI International Conference, VALUETOOLS 2021, Proceedings. Cham: Springer. 2021. p. 154-170. (Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST). Epub 2021 Oct 31. doi: 10.1007/978-3-030-92511-6_10

Author

Gibson, Lachlan J. ; Jacko, Peter ; Nazarathy, Yoni. / A Novel Implementation of Q-Learning for the Whittle Index. Performance Evaluation Methodologies and Tools - 14th EAI International Conference, VALUETOOLS 2021, Proceedings. editor / Qianchuan Zhao ; Li Xia. Cham : Springer, 2021. pp. 154-170 (Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST).

Bibtex

@inproceedings{03f3cae2b46c421e8274a6868008fabb,

title = "A Novel Implementation of Q-Learning for the Whittle Index",

abstract = "We develop a method for learning index rules for multi-armed bandits, restless bandits, and dynamic resource allocation where the underlying transition probabilities and reward structure of the system is not known. Our approach builds on an understanding of both stochastic optimisation (specifically, the Whittle index) and reinforcement learning (specifically, Q-learning). We propose a novel implementation of Q-learning, which exploits the structure of the problem considered, in which the algorithm maintains two sets of Q-values for each project: one for reward and one for resource consumption. Based on these ideas we design a learning algorithm and illustrate its performance by comparing it to the state-of-the-art Q-learning algorithm for the Whittle index by Avrachenkov and Borkar. Both algorithms rely on Q-learning to estimate the Whittle index policy, however the nature in which Q-learning is used in each algorithm is dramatically different. Our approach seems to be able to deliver similar or better performance and is potentially applicable to a much broader and more general set of problems.",

keywords = "Multi-armed bandits, Restless bandits, Reinforcement learning, Q-learning, Markov decision processes",

author = "Gibson, {Lachlan J.} and Peter Jacko and Yoni Nazarathy",

year = "2021",

month = dec,

day = "8",

doi = "10.1007/978-3-030-92511-6_10",

language = "English",

isbn = "9783030925109",

series = "Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST",

publisher = "Springer",

pages = "154--170",

editor = "Qianchuan Zhao and Li Xia",

booktitle = "Performance Evaluation Methodologies and Tools - 14th EAI International Conference, VALUETOOLS 2021, Proceedings",

}

RIS

TY - GEN

T1 - A Novel Implementation of Q-Learning for the Whittle Index

AU - Gibson, Lachlan J.

AU - Jacko, Peter

AU - Nazarathy, Yoni

PY - 2021/12/8

Y1 - 2021/12/8

N2 - We develop a method for learning index rules for multi-armed bandits, restless bandits, and dynamic resource allocation where the underlying transition probabilities and reward structure of the system is not known. Our approach builds on an understanding of both stochastic optimisation (specifically, the Whittle index) and reinforcement learning (specifically, Q-learning). We propose a novel implementation of Q-learning, which exploits the structure of the problem considered, in which the algorithm maintains two sets of Q-values for each project: one for reward and one for resource consumption. Based on these ideas we design a learning algorithm and illustrate its performance by comparing it to the state-of-the-art Q-learning algorithm for the Whittle index by Avrachenkov and Borkar. Both algorithms rely on Q-learning to estimate the Whittle index policy, however the nature in which Q-learning is used in each algorithm is dramatically different. Our approach seems to be able to deliver similar or better performance and is potentially applicable to a much broader and more general set of problems.

AB - We develop a method for learning index rules for multi-armed bandits, restless bandits, and dynamic resource allocation where the underlying transition probabilities and reward structure of the system is not known. Our approach builds on an understanding of both stochastic optimisation (specifically, the Whittle index) and reinforcement learning (specifically, Q-learning). We propose a novel implementation of Q-learning, which exploits the structure of the problem considered, in which the algorithm maintains two sets of Q-values for each project: one for reward and one for resource consumption. Based on these ideas we design a learning algorithm and illustrate its performance by comparing it to the state-of-the-art Q-learning algorithm for the Whittle index by Avrachenkov and Borkar. Both algorithms rely on Q-learning to estimate the Whittle index policy, however the nature in which Q-learning is used in each algorithm is dramatically different. Our approach seems to be able to deliver similar or better performance and is potentially applicable to a much broader and more general set of problems.

KW - Multi-armed bandits

KW - Restless bandits

KW - Reinforcement learning

KW - Q-learning

KW - Markov decision processes

U2 - 10.1007/978-3-030-92511-6_10

DO - 10.1007/978-3-030-92511-6_10

M3 - Conference contribution/Paper

SN - 9783030925109

T3 - Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST

SP - 154

EP - 170

BT - Performance Evaluation Methodologies and Tools - 14th EAI International Conference, VALUETOOLS 2021, Proceedings

A2 - Zhao, Qianchuan

A2 - Xia, Li

PB - Springer

CY - Cham

ER -

Research

Associated organisational units

Links

Text available via DOI:

Keywords