Multicore-based vector coprocessor sharing for performance and energy gains-Reference-Cited by-同舟云学术

Multicore-based vector coprocessor sharing for performance and energy gains

Published:2013-09 Issue:2 Volume:13 Page:1-25
ISSN:1539-9087
Container-title:ACM Transactions on Embedded Computing Systems
language:en
Short-container-title:ACM Trans. Embed. Comput. Syst.

Author:

Beldianu Spiridon F.¹,Ziavras Sotirios G.¹

Affiliation:

1. New Jersey Institute of Technology, Newark, NJ

Abstract

For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. We present a robust design framework for vector coprocessor sharing in multicore environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For our adaptive vector unit, which is attached to multiple cores, we propose three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing. We benchmark these vector coprocessor sharing policies for a dual-core system and evaluate them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization shows that these coprocessor sharing policies yield high utilization and performance with low energy costs. The proposed policies provide 1.2--2 speedups and reduce the energy needs by about 50% as compared to a system having a single core with an attached vector coprocessor. With the performance expressed in clock cycles, the sharing policies demonstrate 3.62--7.92 speedups compared to optimized Xeon runs. We also introduce performance and empirical power models that can be used by the runtime system to estimate the effectiveness of each policy in a hybrid system that can simultaneously implement this suite of shared coprocessor policies.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2514641.2514644

Reference25 articles.

1. Scalar Processing Overhead on SIMD-Only Architectures

2. On-chip Vector Coprocessor Sharing for Multicores

3. VEGAS

4. Simultaneous multithreading: a platform for next-generation processors

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Spatzformer: An Efficient Reconfigurable Dual-Core RISC-V V Cluster for Mixed Scalar-Vector Workloads;2024 IEEE 35th International Conference on Application-specific Systems, Architectures and Processors (ASAP);2024-07-24

2. Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU Cores;Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3;2023-03-25

3. Vector Extensions in COTS Processors to Increase Guaranteed Performance in Real-Time Systems;ACM Transactions on Embedded Computing Systems;2023-01-24

4. A Hardware Pipeline with High Energy and Resource Efficiency for FMM Acceleration;ACM Transactions on Embedded Computing Systems;2018-03-31

5. Floating-point accelerator for biometric recognition on FPGA embedded systems;Journal of Parallel and Distributed Computing;2018-02