Home > Research > Publications & Outputs > Bandit learning in concave N-player games

Electronic data

  • BravoLeslieMertikopoulosNIPS2018.pdf

    Accepted author manuscript, 308 KB, PDF document

    Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License

Links

View graph of relations

Bandit learning in concave N-player games

Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSNConference contribution/Paperpeer-review

Published
Close
Publication date2/12/2018
Host publicationNeurIPS Proceedings
Pages5661-5671
Number of pages11
<mark>Original language</mark>English
EventThirty-second Conference on Neural Information Processing Systems - Palais des Congres de Montreal, Montreal, Canada
Duration: 2/12/20188/12/2018
https://nips.cc/Conferences/2018

Conference

ConferenceThirty-second Conference on Neural Information Processing Systems
Abbreviated titleNeurIPS 2018
Country/TerritoryCanada
CityMontreal
Period2/12/188/12/18
Internet address

Conference

ConferenceThirty-second Conference on Neural Information Processing Systems
Abbreviated titleNeurIPS 2018
Country/TerritoryCanada
CityMontreal
Period2/12/188/12/18
Internet address

Abstract

This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, the agents' most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players' behavior stabilizes in the long run: no-regret learning may lead to cycles, even with perfect gradient information. However, if a standard monotonicity condition is satisfied, our analysis shows that no-regret learning based on mirror descent with bandit feedback converges to Nash equilibrium with probability 1. We also derive an upper bound for the convergence rate of the process that nearly matches the best attainable rate for single-agent bandit stochastic optimization.