Accepted author manuscript, 4.11 MB, PDF document
Available under license: CC BY-NC: Creative Commons Attribution-NonCommercial 4.0 International License
Final published version
Final published version
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
Research output: Contribution in Book/Report/Proceedings - With ISBN/ISSN › Conference contribution/Paper › peer-review
}
TY - GEN
T1 - Bandits with Delayed, Aggregated Anonymous Feedback
AU - Pike-Burke, Ciara
AU - Agrawal, Shipra
AU - Szepesvari, Csaba
AU - Grunewalder, Steffen
PY - 2018/7/15
Y1 - 2018/7/15
N2 - We study a variant of the stochastic K-armed bandit problem, which we call “bandits with delayed,aggregated anonymous feedback”. In this problem, when the player pulls an arm, a rewardis generated, however it is not immediately observed.Instead, at the end of each round the player observes only the sum of a number of previouslygenerated rewards which happen to arrive in the given round. The rewards are stochasticallydelayed and due to the aggregated nature of the observations, the information of which armled to a particular reward is lost. The question is what is the cost of the information loss due to this delayed, aggregated anonymous feedback? Previous works have studied bandits with stochastic, non-anonymous delays and found that the regret increases only by an additive factor relating to the expected delay. In this paper, we show that this additive regret increase can be maintained in the harder delayed, aggregated anonymous feedback setting when the expected delay (or a bound on it) is known. We provide an algorithm that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.
AB - We study a variant of the stochastic K-armed bandit problem, which we call “bandits with delayed,aggregated anonymous feedback”. In this problem, when the player pulls an arm, a rewardis generated, however it is not immediately observed.Instead, at the end of each round the player observes only the sum of a number of previouslygenerated rewards which happen to arrive in the given round. The rewards are stochasticallydelayed and due to the aggregated nature of the observations, the information of which armled to a particular reward is lost. The question is what is the cost of the information loss due to this delayed, aggregated anonymous feedback? Previous works have studied bandits with stochastic, non-anonymous delays and found that the regret increases only by an additive factor relating to the expected delay. In this paper, we show that this additive regret increase can be maintained in the harder delayed, aggregated anonymous feedback setting when the expected delay (or a bound on it) is known. We provide an algorithm that matches the worst case regret of the non-anonymous problem exactly when the delays are bounded, and up to logarithmic factors or an additive variance term for unbounded delays.
M3 - Conference contribution/Paper
T3 - Proceedings of Machine Learning Research
SP - 4105
EP - 4123
BT - Proceedings of the International Conference on Machine Learning, 10-15 July 2018, Stockholmsmässan, Stockholm Sweden
A2 - Dy, Jennifer
PB - PMLR
ER -