Rigel-Reference-Cited by-同舟云学术

Rigel

Published:2009-06-15 Issue:3 Volume:37 Page:140-151
ISSN:0163-5964
Container-title:ACM SIGARCH Computer Architecture News
language:en
Short-container-title:SIGARCH Comput. Archit. News

Author:

Kelm John H.¹,Johnson Daniel R.¹,Johnson Matthew R.¹,Crago Neal C.¹,Tuohy William¹,Mahesri Aqeel¹,Lumetta Steven S.¹,Frank Matthew I.¹,Patel Sanjay J.¹

Affiliation:

1. University of Illinois, Urbana, IL, USA

Abstract

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm 2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/1555815.1555774

Reference31 articles.

1. Design tradeoffs for tiled CMP on-chip networks

2. Scans as primitive parallel operations

3. Memory bandwidth limitations of future microprocessors

4. A hierarchical task queue organization for shared-memory multiprocessor systems

5. Poster reception---N-Body simulation on GPUs

Cited by 34 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Mach-RT: A Many Chip Architecture for High Performance Ray Tracing;IEEE Transactions on Visualization and Computer Graphics;2022-03-01

2. Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache;ACM Transactions on Architecture and Code Optimization;2021-12-31

3. Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters;IEEE Transactions on Parallel and Distributed Systems;2021-03-01

4. Ch’i: Scaling Microkernel Capabilities in Cache-Incoherent Systems;2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers (ROSS);2020-11

5. Transmuter;Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques;2020-09-30