Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore Scale-Reference-Cited by-同舟云学术

Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore Scale

Published:2019-03-08 Issue:1 Volume:16 Page:1-27
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Dogan Halit¹,Ahmad Masab¹,Kahne Brian²,Khan Omer³

Affiliation:

1. University of Connecticut, Storrs, Connecticut, USA

2. NXP Semiconductors, Austin, TX

3. University of Connecticut, Connecticut, USA

Abstract

Thread synchronization using shared memory hardware cache coherence paradigm is prevalent in multicore processors. However, as the number of cores increase on a chip, cache line ping-pong prevents performance scaling for algorithms that deploy fine-grain synchronization. This article proposes an in-hardware moving computation to data model (MC) that pins shared data at dedicated cores. The critical code sections are serialized and executed at these cores in a spatial setting to enable data locality optimizations. In-hardware messages enable non-blocking and blocking communication between cores, without involving the cache coherence protocol. The in-hardware MC model is implemented on Tilera Tile-Gx72 multicore platform to evaluate 8- to 64-core count scale. A simulated RISC-V multicore environment is built to further evaluate the performance scaling advantages of the MC model at 1,024-cores scale. The evaluation using graph and machine-learning benchmarks illustrates that atomic instructions based synchronization scales up to 512 cores, and the MC model at the same core count outperforms by 27% in completion time and 39% in dynamic energy consumption.

Funder

National Science Foundation

Semiconductor Research Corporation

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3300208

Reference32 articles.

1. CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores

2. A scalable processing-in-memory accelerator for parallel graph processing

3. R. Bayer and M. Schkolnick. 1988. Concurrency of Operations on B-trees. In Readings in Database Systems. Morgan Kaufmann Publishers Inc. San Francisco CA 129--139. R. Bayer and M. Schkolnick. 1988. Concurrency of Operations on B-trees. In Readings in Database Systems. Morgan Kaufmann Publishers Inc. San Francisco CA 129--139.

4. ImageNet: A large-scale hierarchical image database

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Characterization of Timing-based Software Side-channel Attacks and Mitigations on Network-on-Chip Hardware;ACM Journal on Emerging Technologies in Computing Systems;2023-06-21

2. MergePath-SpMM: Parallel Sparse Matrix-Matrix Algorithm for Graph Neural Network Acceleration;2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS);2023-04

3. Arbitrarily Parallelizable Code: A Model of Computation Evaluated on a Message-Passing Many-Core System;Computers;2022-11-18

4. Protecting On-Chip Data Access Against Timing-Based Side-Channel Attacks on Multicores;2022 IEEE International Symposium on Secure and Private Execution Environment Design (SEED);2022-09

5. SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems;Proceedings of the 51st International Conference on Parallel Processing;2022-08-29