Reinforcement Learning and Dynamical Systems

Author:

,Lindenberg BjörnORCID,

Abstract

This thesis concerns reinforcement learning and dynamical systems in finite discrete problem domains. Artificial intelligence studies through reinforcement learning involves developing models and algorithms for scenarios when there is an agent that is interacting with an environment. By taking actions the agent may induce changes in the observed environment, where a modeled reward system reinforces correct behavior through learning algorithms. Reinforcement learning can be used in a wide variety of different domains, such as finance, robotics, games, routing and health care. However as the subject matures there is an increasing need to more heavily rely on advanced concepts in mathematics and deep learning to further our understanding of existing problems and find new algorithmic insights. Discrete dynamical systems arise in virtually any setting as soon as there is a set of elements subjected to iteration by a defining function. The function may be seen to represent the passing of time or to define the rules for state transitions. If the set of elements is finite but very large then we may find applications in several different fields such as operations research, cryptography and biology, where understanding properties of the structure and long-term behavior without explicit enumeration is key. In Paper I we extend the model of categorical reinforcement learning with a group-aided training procedure involving multiple agents. By having the agents learn through shared distributional information but act independently we argue for an accelerated learning process. We empirically show that the procedure may lead to much more robust learning, stronger individual agent performance and good ensemble efficiency. In Paper II we continue to build upon distributional reinforcement learning for finite Markov processes. A common approach among algorithms is to apply transformations on agent returns for stability and flexibility over a variety of different tasks. We show that one of the most successful methods may not work for a stochastic process. As a solution we introduce a new distributional operator that handles a large class of transformations with guaranteed theoretical convergence. We also propose an approximating single-actor algorithm based on these novel insights, which when tested achieves state-of-the-art performance compared to similar algorithms. In Paper III we focus on the issue of efficient exploration in reinforcement learning by studying the regret that a learning algorithm might have versus an optimal policy. Specifically, in the paper we derive a Bayesian regret bound for Thompson sampling on linear multi-armed bandits with Gaussian reward models when we have environment uncertainty described by a set of multivariate normal-gamma distributions. In Paper IV we derive sharp bounds on the number of iterations needed for any linear finite dynamical system to stabilize on an inner set of cycles. These bounds may be used in cycle analysis and criterion tests to understand the long-term behavior of extremely large systems.

Publisher

Linnaeus University

Reference65 articles.

1. Adrià Puigdomènech Badia, Bilal Piot, Steven Kapturowski, Pablo Sprechmann, Alex Vitvitskyi, Zhaohan Daniel Guo, and Charles Blundell. Agent57: Outperforming the atari human benchmark. In International conference on machine learning, pages 507-517. PMLR, 2020.

2. Alan Mathison Turing. Mind. Mind, 59(236):433-460, 1950. [TVR97] J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674-690, 1997.

3. Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1-286, 2019.

4. Ashwin Rao and Tikhon Jelvis. Foundations of Reinforcement Learning with Applications in Finance. CRC Press, 2022.

5. Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascading bandits. Advances in Neural Information Processing Systems, 28, 2015.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3