DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems-Reference-Cited by-同舟云学术

DeepFlow: A Cross-Stack Pathfinding Framework for Distributed AI Systems

Published:2023-12-21 Issue: Volume: Page:
ISSN:1084-4309
Container-title:ACM Transactions on Design Automation of Electronic Systems
language:en
Short-container-title:ACM Trans. Des. Autom. Electron. Syst.

Author:

Ardalani Newsha¹,Pal Saptadeep²,Gupta Puneet²

Affiliation:

1. Meta, Inc., USA

2. UCLA, USA

Abstract

Over the past decade, machine learning model complexity has grown at an extraordinary rate, as has the scale of the systems training such large models. However there is an alarmingly low hardware utilization (5-20%) in large scale AI systems. The low system utilization is a cumulative effect of minor losses across different layers of the stack, exacerbated by the disconnect between engineers designing different layers spanning across different industries. To address this challenge, in this work we designed a cross-stack performance modelling and design space exploration framework. First, we introduce CrossFlow, a novel framework that enables cross-layer analysis all the way from the technology layer to the algorithmic layer. Next, we introduce DeepFlow (built on top of CrossFlow using machine learning techniques) to automate the design space exploration and co-optimization across different layers of the stack. We have validated CrossFlow’s accuracy with distributed training on real commercial hardware and showcase several DeepFlow case studies demonstrating pitfalls of not optimizing across the technology-hardware-software stack for what is likely, the most important workload driving large development investments in all aspects of computing stack.

Publisher

Association for Computing Machinery (ACM)

Subject

Electrical and Electronic Engineering,Computer Graphics and Computer-Aided Design,Computer Science Applications

Link

https://dl.acm.org/doi/pdf/10.1145/3635867

Reference26 articles.

1. OpenAI. AI and Compute. https://openai.com/blog/ai-and-compute/. ([n. d.]). OpenAI. AI and Compute. https://openai.com/blog/ai-and-compute/. ([n. d.]).

2. Kunle Olukotun. 2020. Accelerating Software 2.0. ScaledML (2020). Kunle Olukotun. 2020. Accelerating Software 2.0. ScaledML (2020).

3. Zhihao Jia Matei Zaharia and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358(2018). Zhihao Jia Matei Zaharia and Alex Aiken. 2018. Beyond data and model parallelism for deep neural networks. arXiv preprint arXiv:1807.05358(2018).

4. Amazon AWS Inferentia . (accessed Sep 10, 2021). Achieve 12x higher throughput and lowest latency for PyTorch Natural Language Processing applications out-of-the-box on AWS Inferentia. https://tinyurl.com/3mbuetmr. ((accessed Sep 10, 2021 )). Amazon AWS Inferentia. (accessed Sep 10, 2021). Achieve 12x higher throughput and lowest latency for PyTorch Natural Language Processing applications out-of-the-box on AWS Inferentia. https://tinyurl.com/3mbuetmr. ((accessed Sep 10, 2021)).

5. Timeloop: A Systematic Approach to DNN Accelerator Evaluation

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation;Proceedings of the 15th ACM SIGOPS Asia-Pacific Workshop on Systems;2024-09-04

2. System technology co-optimization for advanced integration;Nature Reviews Electrical Engineering;2024-09-02

3. Proof-of-Concept of a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation;Proceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing;2024-08-04

4. MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29