Autonomous Orchestration of Distributed Discrete Event Simulations in the Presence of Resource Uncertainty-Reference-Cited by-同舟云学术

Autonomous Orchestration of Distributed Discrete Event Simulations in the Presence of Resource Uncertainty

Published:2015-10-08 Issue:3 Volume:10 Page:1-20
ISSN:1556-4665
Container-title:ACM Transactions on Autonomous and Adaptive Systems
language:en
Short-container-title:ACM Trans. Auton. Adapt. Syst.

Author:

Sui Zhiquan¹,Malensek Matthew¹,Harvey Neil²,Pallickara Shrideep¹

Affiliation:

1. Colorado State University, CO, USA

2. University of Guelph, Ontario, Canada

Abstract

Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of events and conditions provides a more nuanced model, but also increases its computational footprint. To manage these processing requirements in a scalable manner, discrete event simulations can be distributed across multiple computing resources. Orchestrating the simulations in a distributed setting involves coping with resource uncertainty. We consider three key aspects of resource uncertainty: resource failures, heterogeneity, and slowdowns. Each of these aspects is managed autonomously, which involves making accurate predictions of future execution times and latencies while also accounting for differences in hardware capabilities and dynamic resource consumption profiles. Further complicating matters, individual tasks within the simulation are stateful and stochastic, requiring inter-task communication and synchronization to produce accurate outcomes. We deal with these challenges through intelligent state collection and migration, active resource monitoring, and empirical evaluation of resource capabilities under changing conditions. To underscore the viability of our solution, we provide benchmarks using a production discrete event simulation that can simultaneously sustain failures, manage resource heterogeneity, and handle slowdowns while being orchestrated by our framework.

Funder

US Department of Homeland Security's Long Range program

Publisher

Association for Computing Machinery (ACM)

Subject

Software,Computer Science (miscellaneous),Control and Systems Engineering

Link

https://dl.acm.org/doi/pdf/10.1145/2746345

Reference31 articles.

1. Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

2. RAMS 2001: Current status and future directions

3. Augmenting the CAVE: An Initial Study into Close Focused, Inward Looking, Exploration in IPT Systems

4. Parallel and distributed simulation from many cores to the public cloud

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Uncertainty-aware Decisions in Cloud Computing;ACM Computing Surveys;2022-05-31

2. Self-Adaptive Software Systems in Contested and Resource-Constrained Environments: Overview and Challenges;IEEE Access;2021

3. Discrete-event simulation of a production process for increasing the efficiency of a newspaper production;IOP Conference Series: Materials Science and Engineering;2019-06-07

4. Scalable network analytics for characterization of outbreak influence in voluminous epidemiology datasets;Concurrency and Computation: Practice and Experience;2018-10-22