Home > Research > Publications & Outputs > A Novel Implementation of Q-Learning for the Wh...


Text available via DOI:

View graph of relations

A Novel Implementation of Q-Learning for the Whittle Index

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Publication date8/12/2021
Host publicationPerformance Evaluation Methodologies and Tools - 14th EAI International Conference, VALUETOOLS 2021, Proceedings
EditorsQianchuan Zhao, Li Xia
Place of PublicationCham
Number of pages17
ISBN (electronic)9783030925116
ISBN (print)9783030925109
<mark>Original language</mark>English

Publication series

NameLecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST
Volume404 LNICST
ISSN (Print)1867-8211
ISSN (electronic)1867-822X


We develop a method for learning index rules for multi-armed bandits, restless bandits, and dynamic resource allocation where the underlying transition probabilities and reward structure of the system is not known. Our approach builds on an understanding of both stochastic optimisation (specifically, the Whittle index) and reinforcement learning (specifically, Q-learning). We propose a novel implementation of Q-learning, which exploits the structure of the problem considered, in which the algorithm maintains two sets of Q-values for each project: one for reward and one for resource consumption. Based on these ideas we design a learning algorithm and illustrate its performance by comparing it to the state-of-the-art Q-learning algorithm for the Whittle index by Avrachenkov and Borkar. Both algorithms rely on Q-learning to estimate the Whittle index policy, however the nature in which Q-learning is used in each algorithm is dramatically different. Our approach seems to be able to deliver similar or better performance and is potentially applicable to a much broader and more general set of problems.