Scale-Out vs Scale-Up-Reference-Cited by-同舟云学术

Scale-Out vs Scale-Up

Published:2018-12-31 Issue:4 Volume:3 Page:1-23
ISSN:2376-3639
Container-title:ACM Transactions on Modeling and Performance Evaluation of Computing Systems
language:en
Short-container-title:ACM Trans. Model. Perform. Eval. Comput. Syst.

Author:

Azimi Reza¹^ORCID,Fox Tyler¹,Gonzalez Wendy¹,Reda Sherief¹

Affiliation:

1. Brown University, Providence, RI, USA

Abstract

ARM 64-bit processing has generated enthusiasm to develop ARM-based servers that are targeted for both data centers and supercomputers. In addition to the server-class components and hardware advancements, the ARM software environment has grown substantially over the past decade. Major development ecosystems and libraries have been ported and optimized to run on ARM, making ARM suitable for server-class workloads. There are two trends in available ARM SoCs: mobile-class ARM SoCs that rely on the heterogeneous integration of a mix of CPU cores, GPGPU streaming multiprocessors (SMs), and other accelerators, and the server-class SoCs that instead rely on integrating a larger number of CPU cores with no GPGPU support and a number of IO accelerators. For scaling the number of processing cores, there are two different paradigms: mobile-class SoCs that use scale-out architecture in the form of a cluster of simpler systems connected over a network, and server-class ARM SoCs that use the scale-up solution and leverage symmetric multiprocessing to pack a large number of cores on the chip. In this article, we present ScaleSoC cluster, which is a scale-out solution based on mobile class ARM SoCs. ScaleSoC leverages fast network connectivity and GPGPU acceleration to improve performance and energy efficiency compared to previous ARM scale-out clusters. We consider a wide range of modern server-class parallel workloads to study both scaling paradigms, including latency-sensitive transactional workloads, MPI-based CPU and GPGPU-accelerated scientific applications, and emerging artificial intelligence workloads. We study the performance and energy efficiency of ScaleSoC compared to server-class ARM SoCs and discrete GPGPUs in depth. We quantify the network overhead on the performance of ScaleSoC and show that packing a large number of ARM cores on a single chip does not necessarily guarantee better performance, due to the fact that shared resources, such as last-level cache, become performance bottlenecks. We characterize the GPGPU accelerated workloads and demonstrate that for applications that can leverage the better CPU-GPGPU balance of the ScaleSoC cluster, performance and energy efficiency improve compared to discrete GPGPUs.

Funder

NSF

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture,Safety, Risk, Reliability and Quality,Media Technology,Information Systems,Software,Computer Science (miscellaneous)

Link

https://dl.acm.org/doi/pdf/10.1145/3232162

Reference42 articles.

1. Tensorflow: A system for large-scale machine learning;Abadi Martín;OSDI,2016

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU;IEEE Transactions on Parallel and Distributed Systems;2023-07

2. Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud;2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS);2023-06-11

3. Characterizing and Optimizing Transformer Inference on ARM Many-core Processor;Proceedings of the 51st International Conference on Parallel Processing;2022-08-29

4. NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture;Electronics;2021-08-17

5. Feasibility of image-based augmented reality guidance of total shoulder arthroplasty using microsoft HoloLens 1;Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization;2020-10-27