SoftRM-Reference-Cited by-同舟云学术

SoftRM

Published:2017-10-10 Issue:5s Volume:16 Page:1-19
ISSN:1539-9087
Container-title:ACM Transactions on Embedded Computing Systems
language:en
Short-container-title:ACM Trans. Embed. Comput. Syst.

Author:

Tsoutsouras Vasileios¹,Masouros Dimosthenis¹,Xydis Sotirios¹,Soudris Dimitrios¹

Affiliation:

1. National Technical University of Athens, Greece

Abstract

Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated on-chip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awareness optimizes this choice according to the status of each core. We evaluate the proposed framework on Intel Single-chip Cloud Computer (SCC), a NoC based many-core system and customize it to achieve minimum interference on the resource allocation process. We showcase that its workload-aware features manage to utilize free resources in more that 90% of the conducted experiments. Comparison with relevant state-of-the-art fault tolerant frameworks shows decrease of up to 67% in the imposed overhead on application execution.

Funder

VINEYARD under H2020

E.C. funded programs AEGLE under H2020

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3126562

Reference40 articles.

1. Scaling towards kilo-core processors with asymmetric high-radix topologies

2. Distributed run-time resource management for malleable applications on many-core platforms

3. First Application of Lattice QCD to Pezy-SC Processor

4. A 5.8 pJ/Op 115 billion ops/sec, to 1.78 trillion ops/sec 32nm 1000-processor array

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Survey of MPSoC Management toward Self-Awareness;Micromachines;2024-04-26

2. A Survey of Software-Defined Networks-on-Chip: Motivations, Challenges and Opportunities;Micromachines;2021-02-12

3. System management recovery in NoC-based many-core systems;Analog Integrated Circuits and Signal Processing;2020-03-12