Affiliation:
1. Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Piazza Leonardo da Vinci, 32, Milan, Italy
Abstract
In this paper, we provide a unified presentation of the Configurable Markov Decision Process (Conf-MDP) framework. A Conf-MDP is an extension of the traditional Markov Decision Process (MDP) that models the possibility to configure some environmental parameters. This configuration activity can be carried out by the learning agent itself or by an external configurator. We introduce a general definition of Conf-MDP, then we particularize it for the cooperative setting, where the configuration is fully functional to the agent’s goals, and non-cooperative setting, in which agent and configurator might have different interests. For both settings, we propose suitable solution concepts. Furthermore, we illustrate how to extend the traditional value functions for MDPs and Bellman operators to this new framework.
Reference51 articles.
1. Peter Auer, Nicolò Cesa-Bianchi and Paul Fischer, Finite-time analysis of the multiarmed bandit problem, Mach. Learn. 47(2-3) (2002), 235–256.
2. Richard Bellman, The theory of dynamic programming, Bulletin of the American Mathematical Society 60(6) (1954), 503–515.
3. A. Bensoussan and J. Frehse, Stochastic games for n players, Journal of Optimization Theory and Applications 105(3) (2000), 543–565.
4. Krishnendu Chatterjee, Rupak Majumdar and Marcin Jurdzinski, On Nash Equilibria in Stochastic Games. In Computer Science Logic, 18th International Workshop (CSL), pages 26–40, 2004.
5. Kamil Andrzej Ciosek and Shimon Whiteson, OFFER: off-environment reinforcement learning, In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), pages 1819–1825, AAAI Press, 2017.