1. Richard Bellman. “A Markovian Decision Process”. In: Journal of Mathematics and Mechanics 6 (1957).
2. Richard Bellman. Dynamic Programming. 1st ed. Princeton, NJ, USA: Princeton University Press, 1957.
3. John C Gittins. “Bandit processes and dynamic allocation indices”. In: Journal of the Royal Statistical Society. Series B (Methodological) (1979), pp. 148–177.
4. Ronald A. Howard. Dynamic Programming and Markov Processes. Technology Press and Wiley, 1960.
5. Thomas Jaksch, Ronald Ortner, and Peter Auer. “Near-optimal regret bounds for reinforcement learning”. In: Journal of Machine Learning Research 11.Apr (2010), pp. 1563–1600.