Home > Research > Publications & Outputs > Decentralized Q-learning in Zero-sum Markov Games

Electronic data

  • SayinEtAl2021_NeurIPS

    Accepted author manuscript, 813 KB, PDF document

  • SayinEtAl2021_NeurIPS_supplemental

    Accepted author manuscript, 593 KB, PDF document

Links

View graph of relations

Decentralized Q-learning in Zero-sum Markov Games

Research output: Contribution to conference - Without ISBN/ISSN Conference paperpeer-review

Published
  • Muhammed O. Sayin
  • Kaiqing Zhang
  • David Leslie
  • Tamer Basar
  • Asuman Ozdaglar
Close
Publication date6/12/2021
Number of pages35
<mark>Original language</mark>English
EventNeurIPS 2021: Thirty-fifth Conference on Neural Information Processing Systems - Virtual
Duration: 6/12/202114/12/2021
Conference number: 35th
https://nips.cc/

Conference

ConferenceNeurIPS 2021
Period6/12/2114/12/21
Internet address

Abstract

We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.