1. Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. In: Mannor S, Srebro N, Williamson RC (eds) Proceedings of the 25th annual conference on learning theory, proceedings of machine learning research, vol 23. PMLR, Edinburgh, Scotland, pp 39.1–39.26. http://proceedings.mlr.press/v23/agrawal12.html
2. Agrawal S, Goyal N (2013) Further optimal regret bounds for Thompson sampling. In: Sixteenth international conference on artificial intelligence and statistics (AISTATS). https://www.microsoft.com/en-us/research/publication/further-optimal-regret-bounds-for-thompson-sampling/
3. Allenberg C, Auer P, Györfi L, Ottucsák G (2006) Hannan consistency in on-line learning in case of unbounded losses under partial monitoring. In: Algorithmic learning theory: 17th international conference, ALT 2006, Barcelona, Spain, October 7–10, 2006, proceedings, lecture notes in computer science, pp 229–243. Springer, Berlin. https://books.google.com/books?id=lsmpCAAAQBAJ
4. Audibert JY, Munos R, Szepesvári C (2009) Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theor Comput Sci 410(19):1876–1902
5. Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47(2–3):235–256