Scale-Out vs Scale-Up

Author:

Azimi Reza1ORCID,Fox Tyler1,Gonzalez Wendy1,Reda Sherief1

Affiliation:

1. Brown University, Providence, RI, USA

Abstract

ARM 64-bit processing has generated enthusiasm to develop ARM-based servers that are targeted for both data centers and supercomputers. In addition to the server-class components and hardware advancements, the ARM software environment has grown substantially over the past decade. Major development ecosystems and libraries have been ported and optimized to run on ARM, making ARM suitable for server-class workloads. There are two trends in available ARM SoCs: mobile-class ARM SoCs that rely on the heterogeneous integration of a mix of CPU cores, GPGPU streaming multiprocessors (SMs), and other accelerators, and the server-class SoCs that instead rely on integrating a larger number of CPU cores with no GPGPU support and a number of IO accelerators. For scaling the number of processing cores, there are two different paradigms: mobile-class SoCs that use scale-out architecture in the form of a cluster of simpler systems connected over a network, and server-class ARM SoCs that use the scale-up solution and leverage symmetric multiprocessing to pack a large number of cores on the chip. In this article, we present ScaleSoC cluster, which is a scale-out solution based on mobile class ARM SoCs. ScaleSoC leverages fast network connectivity and GPGPU acceleration to improve performance and energy efficiency compared to previous ARM scale-out clusters. We consider a wide range of modern server-class parallel workloads to study both scaling paradigms, including latency-sensitive transactional workloads, MPI-based CPU and GPGPU-accelerated scientific applications, and emerging artificial intelligence workloads. We study the performance and energy efficiency of ScaleSoC compared to server-class ARM SoCs and discrete GPGPUs in depth. We quantify the network overhead on the performance of ScaleSoC and show that packing a large number of ARM cores on a single chip does not necessarily guarantee better performance, due to the fact that shared resources, such as last-level cache, become performance bottlenecks. We characterize the GPGPU accelerated workloads and demonstrate that for applications that can leverage the better CPU-GPGPU balance of the ScaleSoC cluster, performance and energy efficiency improve compared to discrete GPGPUs.

Funder

NSF

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture,Safety, Risk, Reliability and Quality,Media Technology,Information Systems,Software,Computer Science (miscellaneous)

Reference42 articles.

1. Tensorflow: A system for large-scale machine learning;Abadi Martín;OSDI,2016

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Full-Stack Optimizing Transformer Inference on ARM Many-Core CPU;IEEE Transactions on Parallel and Distributed Systems;2023-07

2. Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud;2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS);2023-06-11

3. Characterizing and Optimizing Transformer Inference on ARM Many-core Processor;Proceedings of the 51st International Conference on Parallel Processing;2022-08-29

4. NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture;Electronics;2021-08-17

5. Feasibility of image-based augmented reality guidance of total shoulder arthroplasty using microsoft HoloLens 1;Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization;2020-10-27

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3