Learning the Optimal Control for Evolving Systems with Converging Dynamics

Author:

Liu Qingsong1ORCID,Fang Zhixuan2ORCID

Affiliation:

1. Tsinghua University & Shanghai Qi Zhi Institute, Beijing, China

2. Tsinghua Univerisity & Shanghai Qi Zhi Institute, Beijing, China

Abstract

We consider a principle or controller that can pick actions from a fixed action set to control an evolving system with converging dynamics. The actions are interpreted as different configurations or policies. We consider systems with converging dynamics, i.e., if the principle holds the same action, the system will asymptotically converge (possibly requiring a significant amount of time) to a unique stable state determined by this action. This phenomenon can be observed in diverse domains such as epidemic control, computing systems, and markets. In our model, the dynamics of the system are unknown to the principle, and the principle can only receive bandit feedback (maybe noisy) on the impacts of his actions. The principle aims to learn which stable state yields the highest reward while adhering to specific constraints (i.e., optimal stable state) and to immerse the system into this state as quickly as possible. A unique challenge in our model is that the principle has no prior knowledge about the stable state of each action, but waits for the system to converge to the suboptimal stable states costs valuable time. We measure the principle's performance in terms of regret and constraint violation. In cases where the action set is finite, we propose a novel algorithm, termed Optimistic-Pessimistic Convergence and Confidence Bounds (OP-C2B), that knows to switch an action quickly if it is not worth waiting until the stable state is reached. This is enabled by employing "convergence bounds" to determine how far the system is from the stable states, and choosing actions through maintaining a pessimistic assessment of the set of feasible actions while acting optimistically within this set. We establish that OP-C2B can ensure sublinear regret and constraint violation simultaneously. Particularly, OP-C2B achieves logarithmic regret and constraint violation when the system convergence rate is linear or superlinear. Furthermore, we generalize our algorithm OP-C2B to the case of an infinite action set and demonstrate its ability to maintain sublinear regret and constraint violation. We finally show two game control problems including mobile crowdsensing and resource allocation that our model can address.

Funder

Tsinghua University

Publisher

Association for Computing Machinery (ACM)

Reference67 articles.

1. Yasin Abbasi-Yadkori. Online learning for linearly parametrized control problems. 2013.

2. Bandits with Global Convex Constraints and Objective

3. Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits under safety constraints. arXiv preprint arXiv:1908.05814, 2019.

4. Sample-efficient learning of stackelberg equilibria in general-sum games;Bai Yu;Advances in Neural Information Processing Systems,2021

5. The value of network information: Assortative mixing makes the difference

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3