Individual Q-learning in normal form games

School Of Mathematical Sciences

Associated organisational units

Text available via DOI:

https://doi.org/10.1137/S0363012903437976
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Individual Q-learning in normal form games. / Leslie, David S.; Collins, E. J.
In: SIAM Journal on Control and Optimization, Vol. 44, No. 2, 01.01.2005, p. 495-514.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Leslie, DS & Collins, EJ 2005, 'Individual Q-learning in normal form games', SIAM Journal on Control and Optimization, vol. 44, no. 2, pp. 495-514. https://doi.org/10.1137/S0363012903437976

APA

Leslie, D. S., & Collins, E. J. (2005). Individual Q-learning in normal form games. SIAM Journal on Control and Optimization, 44(2), 495-514. https://doi.org/10.1137/S0363012903437976

Vancouver

Leslie DS, Collins EJ. Individual Q-learning in normal form games. SIAM Journal on Control and Optimization. 2005 Jan 1;44(2):495-514. doi: 10.1137/S0363012903437976

Author

Leslie, David S. ; Collins, E. J. / Individual Q-learning in normal form games. In: SIAM Journal on Control and Optimization. 2005 ; Vol. 44, No. 2. pp. 495-514.

Bibtex

@article{2f4ecf2014c446088d4c3584c0355437,

title = "Individual Q-learning in normal form games",

abstract = "The single-agent multi-armed bandit problem can be solved by an agent that learns the values of each action using reinforcement learning. However, the multi-agent version of the problem, the iterated normal form game, presents a more complex challenge, since the rewards available to each agent depend on the strategies of the others. We consider the behavior of value-based learning agents in this situation, and show that such agents cannot generally play at a Nash equilibrium, although if smooth best responses are used, a Nash distribution can be reached. We introduce a particular value-based learning algorithm, which we call individual Q-learning, and use stochastic approximation to study the asymptotic behavior, showing that strategies will converge to Nash distribution almost surely in 2-player zero-sum games and 2-player partnership games. Player-dependent learning rates are then considered, and it is shown that this extension converges in some games for which many algorithms, including the basic algorithm initially considered, fail to converge.",

author = "Leslie, {David S.} and Collins, {E. J.}",

year = "2005",

month = jan,

day = "1",

doi = "10.1137/S0363012903437976",

language = "English",

volume = "44",

pages = "495--514",

journal = "SIAM Journal on Control and Optimization",

issn = "0363-0129",

publisher = "Society for Industrial and Applied Mathematics Publications",

number = "2",

}

RIS

TY - JOUR

T1 - Individual Q-learning in normal form games

AU - Leslie, David S.

AU - Collins, E. J.

PY - 2005/1/1

Y1 - 2005/1/1

N2 - The single-agent multi-armed bandit problem can be solved by an agent that learns the values of each action using reinforcement learning. However, the multi-agent version of the problem, the iterated normal form game, presents a more complex challenge, since the rewards available to each agent depend on the strategies of the others. We consider the behavior of value-based learning agents in this situation, and show that such agents cannot generally play at a Nash equilibrium, although if smooth best responses are used, a Nash distribution can be reached. We introduce a particular value-based learning algorithm, which we call individual Q-learning, and use stochastic approximation to study the asymptotic behavior, showing that strategies will converge to Nash distribution almost surely in 2-player zero-sum games and 2-player partnership games. Player-dependent learning rates are then considered, and it is shown that this extension converges in some games for which many algorithms, including the basic algorithm initially considered, fail to converge.

AB - The single-agent multi-armed bandit problem can be solved by an agent that learns the values of each action using reinforcement learning. However, the multi-agent version of the problem, the iterated normal form game, presents a more complex challenge, since the rewards available to each agent depend on the strategies of the others. We consider the behavior of value-based learning agents in this situation, and show that such agents cannot generally play at a Nash equilibrium, although if smooth best responses are used, a Nash distribution can be reached. We introduce a particular value-based learning algorithm, which we call individual Q-learning, and use stochastic approximation to study the asymptotic behavior, showing that strategies will converge to Nash distribution almost surely in 2-player zero-sum games and 2-player partnership games. Player-dependent learning rates are then considered, and it is shown that this extension converges in some games for which many algorithms, including the basic algorithm initially considered, fail to converge.

U2 - 10.1137/S0363012903437976

DO - 10.1137/S0363012903437976

M3 - Journal article

VL - 44

SP - 495

EP - 514

JO - SIAM Journal on Control and Optimization

JF - SIAM Journal on Control and Optimization

SN - 0363-0129

IS - 2

ER -

Research

Associated organisational units

Links

Text available via DOI: