Convergent learning algorithms for unknown reward games

School Of Mathematical Sciences

Associated organisational units

Text available via DOI:

https://doi.org/10.1137/120893501
Final published version

View graph of relations

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Published

Standard

Convergent learning algorithms for unknown reward games. / Chapman, Archie C.; Leslie, David S.; Rogers, Alex et al.
In: SIAM Journal on Control and Optimization, Vol. 51, No. 4, 01.01.2013, p. 3154-3180.

Research output: Contribution to Journal/Magazine › Journal article › peer-review

Harvard

Chapman, AC, Leslie, DS, Rogers, A & Jennings, NR 2013, 'Convergent learning algorithms for unknown reward games', SIAM Journal on Control and Optimization, vol. 51, no. 4, pp. 3154-3180. https://doi.org/10.1137/120893501

APA

Chapman, A. C., Leslie, D. S., Rogers, A., & Jennings, N. R. (2013). Convergent learning algorithms for unknown reward games. SIAM Journal on Control and Optimization, 51(4), 3154-3180. https://doi.org/10.1137/120893501

Vancouver

Chapman AC, Leslie DS, Rogers A, Jennings NR. Convergent learning algorithms for unknown reward games. SIAM Journal on Control and Optimization. 2013 Jan 1;51(4):3154-3180. doi: 10.1137/120893501

Author

Chapman, Archie C. ; Leslie, David S. ; Rogers, Alex et al. / Convergent learning algorithms for unknown reward games. In: SIAM Journal on Control and Optimization. 2013 ; Vol. 51, No. 4. pp. 3154-3180.

Bibtex

@article{527a0cd97ae24eaea3568026c30fbb5b,

title = "Convergent learning algorithms for unknown reward games",

abstract = "In this paper, we address the problem of convergence to Nash equilibria in games with rewards that are initially unknown and must be estimated over time from noisy observations. These games arise in many real-world applications, whenever rewards for actions cannot be prespecified and must be learned online, but standard results in game theory do not consider such settings. For this problem, we derive a multiagent version of $\mathcal{Q}$-learning to estimate the reward functions using novel forms of the $\epsilon$-greedy learning policy. Using these $\mathcal{Q}$-learning schemes to estimate reward functions, we then provide conditions guaranteeing the convergence of adaptive play and the better-reply processes to Nash equilibria in potential games and games with more general forms of acyclicity, and of regret matching to the set of correlated equilibria in generic games. A secondary result is that we prove the strong ergoditicity of stochastic adaptive play and stochastic better-reply processes in the case of vanishing perturbations. Finally, we illustrate the efficacy of the algorithms in a set of randomly generated three-player coordination games and show the practical necessity of our results by demonstrating that violations to the derived learning parameter conditions can cause the algorithms to fail to converge.",

author = "Chapman, {Archie C.} and Leslie, {David S.} and Alex Rogers and Jennings, {Nicholas R.}",

year = "2013",

month = jan,

day = "1",

doi = "10.1137/120893501",

language = "English",

volume = "51",

pages = "3154--3180",

journal = "SIAM Journal on Control and Optimization",

issn = "0363-0129",

publisher = "Society for Industrial and Applied Mathematics Publications",

number = "4",

}

RIS

TY - JOUR

T1 - Convergent learning algorithms for unknown reward games

AU - Chapman, Archie C.

AU - Leslie, David S.

AU - Rogers, Alex

AU - Jennings, Nicholas R.

PY - 2013/1/1

Y1 - 2013/1/1

N2 - In this paper, we address the problem of convergence to Nash equilibria in games with rewards that are initially unknown and must be estimated over time from noisy observations. These games arise in many real-world applications, whenever rewards for actions cannot be prespecified and must be learned online, but standard results in game theory do not consider such settings. For this problem, we derive a multiagent version of $\mathcal{Q}$-learning to estimate the reward functions using novel forms of the $\epsilon$-greedy learning policy. Using these $\mathcal{Q}$-learning schemes to estimate reward functions, we then provide conditions guaranteeing the convergence of adaptive play and the better-reply processes to Nash equilibria in potential games and games with more general forms of acyclicity, and of regret matching to the set of correlated equilibria in generic games. A secondary result is that we prove the strong ergoditicity of stochastic adaptive play and stochastic better-reply processes in the case of vanishing perturbations. Finally, we illustrate the efficacy of the algorithms in a set of randomly generated three-player coordination games and show the practical necessity of our results by demonstrating that violations to the derived learning parameter conditions can cause the algorithms to fail to converge.

AB - In this paper, we address the problem of convergence to Nash equilibria in games with rewards that are initially unknown and must be estimated over time from noisy observations. These games arise in many real-world applications, whenever rewards for actions cannot be prespecified and must be learned online, but standard results in game theory do not consider such settings. For this problem, we derive a multiagent version of $\mathcal{Q}$-learning to estimate the reward functions using novel forms of the $\epsilon$-greedy learning policy. Using these $\mathcal{Q}$-learning schemes to estimate reward functions, we then provide conditions guaranteeing the convergence of adaptive play and the better-reply processes to Nash equilibria in potential games and games with more general forms of acyclicity, and of regret matching to the set of correlated equilibria in generic games. A secondary result is that we prove the strong ergoditicity of stochastic adaptive play and stochastic better-reply processes in the case of vanishing perturbations. Finally, we illustrate the efficacy of the algorithms in a set of randomly generated three-player coordination games and show the practical necessity of our results by demonstrating that violations to the derived learning parameter conditions can cause the algorithms to fail to converge.

U2 - 10.1137/120893501

DO - 10.1137/120893501

M3 - Journal article

VL - 51

SP - 3154

EP - 3180

JO - SIAM Journal on Control and Optimization

JF - SIAM Journal on Control and Optimization

SN - 0363-0129

IS - 4

ER -

Research

Associated organisational units

Links

Text available via DOI: