Affiliation:
1. Tsinghua University & Shanghai Qi Zhi Institute, Beijing, China
2. Tsinghua Univerisity & Shanghai Qi Zhi Institute, Beijing, China
Abstract
We consider a principle or controller that can pick actions from a fixed action set to control an evolving system with converging dynamics. The actions are interpreted as different configurations or policies. We consider systems with converging dynamics, i.e., if the principle holds the same action, the system will asymptotically converge (possibly requiring a significant amount of time) to a unique stable state determined by this action. This phenomenon can be observed in diverse domains such as epidemic control, computing systems, and markets. In our model, the dynamics of the system are unknown to the principle, and the principle can only receive bandit feedback (maybe noisy) on the impacts of his actions. The principle aims to learn which stable state yields the highest reward while adhering to specific constraints (i.e., optimal stable state) and to immerse the system into this state as quickly as possible. A unique challenge in our model is that the principle has no prior knowledge about the stable state of each action, but waits for the system to converge to the suboptimal stable states costs valuable time. We measure the principle's performance in terms of regret and constraint violation. In cases where the action set is finite, we propose a novel algorithm, termed Optimistic-Pessimistic Convergence and Confidence Bounds (OP-C2B), that knows to switch an action quickly if it is not worth waiting until the stable state is reached. This is enabled by employing "convergence bounds" to determine how far the system is from the stable states, and choosing actions through maintaining a pessimistic assessment of the set of feasible actions while acting optimistically within this set. We establish that OP-C2B can ensure sublinear regret and constraint violation simultaneously. Particularly, OP-C2B achieves logarithmic regret and constraint violation when the system convergence rate is linear or superlinear. Furthermore, we generalize our algorithm OP-C2B to the case of an infinite action set and demonstrate its ability to maintain sublinear regret and constraint violation. We finally show two game control problems including mobile crowdsensing and resource allocation that our model can address.
Publisher
Association for Computing Machinery (ACM)
Reference67 articles.
1. Yasin Abbasi-Yadkori. Online learning for linearly parametrized control problems. 2013.
2. Bandits with Global Convex Constraints and Objective
3. Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits under safety constraints. arXiv preprint arXiv:1908.05814, 2019.
4. Sample-efficient learning of stackelberg equilibria in general-sum games;Bai Yu;Advances in Neural Information Processing Systems,2021
5. The value of network information: Assortative mixing makes the difference