Affiliation:
1. Cornell University, Ithaca, NY, USA
Abstract
We present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel Q-learning policy with adaptive data-driven discretization. The central idea is to maintain a finer partition of the state-action space in regions which are frequently visited in historical trajectories, and have higher payoff estimates. We demonstrate how our adaptive partitions take advantage of the shape of the optimal Q-function and the joint space, without sacrificing the worst-case performance. In particular, we recover the regret guarantees of prior algorithms for continuous state-action spaces, which additionally require either an optimal discretization as input, and/or access to a simulation oracle. Moreover, experiments demonstrate how our algorithm automatically adapts to the underlying structure of the problem, resulting in much better performance compared both to heuristics and Q-learning with uniform discretization.
Publisher
Association for Computing Machinery (ACM)
Reference32 articles.
1. Luce Brotcorne Gilbert Laporte and Frederic Semet. 2003. Ambulance location and relocation models. European journal of operational research Vol. 147 3 (2003) 451--463. Luce Brotcorne Gilbert Laporte and Frederic Semet. 2003. Ambulance location and relocation models. European journal of operational research Vol. 147 3 (2003) 451--463.
2. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems
3. Sébastien Bubeck Gilles Stoltz Csaba Szepesvári and Rémi Munos. 2009. Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems. 201--208. Sébastien Bubeck Gilles Stoltz Csaba Szepesvári and Rémi Munos. 2009. Online optimization in X-armed bandits. In Advances in Neural Information Processing Systems. 201--208.
Cited by
12 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献