Decentralized Q-learning in Zero-sum Markov Games

School Of Mathematical Sciences

Associated organisational unit

Statistical Artificial Intelligence

Electronic data

SayinEtAl2021_NeurIPS
Accepted author manuscript, 813 KB, PDF document
SayinEtAl2021_NeurIPS_supplemental
Accepted author manuscript, 593 KB, PDF document

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Published

Standard

Decentralized Q-learning in Zero-sum Markov Games. / Sayin, Muhammed O.; Zhang, Kaiqing; Leslie, David et al.
2021. Paper presented at NeurIPS 2021.

Research output: Contribution to conference - Without ISBN/ISSN › Conference paper › peer-review

Harvard

Sayin, MO, Zhang, K, Leslie, D, Basar, T & Ozdaglar, A 2021, 'Decentralized Q-learning in Zero-sum Markov Games', Paper presented at NeurIPS 2021, 6/12/21 - 14/12/21. <https://papers.nips.cc/paper/2021/hash/985e9a46e10005356bbaf194249f6856-Abstract.html>

APA

Sayin, M. O., Zhang, K., Leslie, D., Basar, T., & Ozdaglar, A. (2021). Decentralized Q-learning in Zero-sum Markov Games. Paper presented at NeurIPS 2021. https://papers.nips.cc/paper/2021/hash/985e9a46e10005356bbaf194249f6856-Abstract.html

Vancouver

Sayin MO, Zhang K, Leslie D, Basar T, Ozdaglar A. Decentralized Q-learning in Zero-sum Markov Games. 2021. Paper presented at NeurIPS 2021.

Author

Sayin, Muhammed O. ; Zhang, Kaiqing ; Leslie, David et al. / Decentralized Q-learning in Zero-sum Markov Games. Paper presented at NeurIPS 2021.35 p.

Bibtex

@conference{a195741482dd4f5a99a40c085a7d3323,

title = "Decentralized Q-learning in Zero-sum Markov Games",

abstract = "We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.",

author = "Sayin, {Muhammed O.} and Kaiqing Zhang and David Leslie and Tamer Basar and Asuman Ozdaglar",

year = "2021",

month = dec,

day = "6",

language = "English",

note = "NeurIPS 2021 : Thirty-fifth Conference on Neural Information Processing Systems ; Conference date: 06-12-2021 Through 14-12-2021",

url = "https://nips.cc/",

}

RIS

TY - CONF

T1 - Decentralized Q-learning in Zero-sum Markov Games

AU - Sayin, Muhammed O.

AU - Zhang, Kaiqing

AU - Leslie, David

AU - Basar, Tamer

AU - Ozdaglar, Asuman

N1 - Conference code: 35th

PY - 2021/12/6

Y1 - 2021/12/6

N2 - We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.

AB - We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.

M3 - Conference paper

T2 - NeurIPS 2021

Y2 - 6 December 2021 through 14 December 2021

ER -

Research

Associated organisational unit

Electronic data

Links