SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference Serving-Reference-Cited by-同舟云学术

SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference Serving

Published:2023-05-19 Issue:2 Volume:7 Page:1-26
ISSN:2476-1249
Container-title:Proceedings of the ACM on Measurement and Analysis of Computing Systems
language:en
Short-container-title:Proc. ACM Meas. Anal. Comput. Syst.

Author:

Kumar Adithya¹^ORCID,Sivasubramaniam Anand²^ORCID,Zhu Timothy¹^ORCID

Affiliation:

1. The Pennsylvania State University, University Park, USA

2. Penn State University, University Park, USA

Abstract

The growing adoption of hardware accelerators driven by their intelligent compiler and runtime system counterparts has democratized ML services and precipitously reduced their execution times. This motivates us to shift our attention to efficiently serve these ML services under distributed settings and characterize the overheads imposed by the RPC mechanism ('RPC tax') when serving them on accelerators. The RPC implementations designed over the years implicitly assume the host CPU services the requests, and we focus on expanding such works towards accelerator-based services. While recent proposals calling for SmartNICs to take on this task are reasonable for simple kernels, serving complex ML models requires a more nuanced view to optimize both the data-path and the control/orchestration of these accelerators. We program today's commodity network interface cards (NICs) to split the control and data paths for effective transfer of control while efficiently transferring the payload to the accelerator. As opposed to unified approaches that bundle these paths together, limiting the flexibility in each of these paths, we design and implement SplitRPC - a control + data path optimizing RPC mechanism for ML inference serving. SplitRPC allows us to optimize the datapath to the accelerator while simultaneously allowing the CPU to maintain full orchestration capabilities. We implement SplitRPC on both commodity NICs and SmartNICs and demonstrate how GPU-based ML services running different compiler/runtime systems can benefit. For a variety of ML models served using different inference runtimes, we demonstrate that SplitRPC is effective in minimizing the RPC tax while providing significant gains in throughput and latency over existing kernel by-pass approaches, without requiring expensive SmartNIC devices.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture,Safety, Risk, Reliability and Quality,Computer Science (miscellaneous)

Link

https://dl.acm.org/doi/pdf/10.1145/3589974

Reference76 articles.

1. NaNet: a flexible and configurable low-latency NIC for real-time trigger systems based on GPUs

2. A. Ananda , B. Tay , and E. Koh . 1991. ASTRA-an asynchronous remote procedure call facility . In Proceedings of the International Conference on Distributed Computing Systems (ICDCS). IEEE Computer Society , Los Alamitos, CA, USA, 172,173,174,175,176,177,178,179. https://doi.org/10.1109/ICDCS. 1991 .148661 10.1109/ICDCS.1991.148661 A. Ananda, B. Tay, and E. Koh. 1991. ASTRA-an asynchronous remote procedure call facility. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS). IEEE Computer Society, Los Alamitos, CA, USA, 172,173,174,175,176,177,178,179. https://doi.org/10.1109/ICDCS.1991.148661

3. Apache. 2020. bRPC Framework. https://brpc.apache.org/. [Online ; accessed 09- Aug- 2022 ]. Apache. 2020. bRPC Framework. https://brpc.apache.org/. [Online; accessed 09-Aug-2022].

4. Serhat Arslan , Stephen Ibanez , Alex Mallery , Changhoon Kim , and Nick McKeown . 2021 . NanoTransport: A Low-Latency, Programmable Transport Layer for NICs . In Proceedings of the ACM SIGCOMM Symposium on SDN Research (SOSR). Association for Computing Machinery , New York, NY, USA, 13--26. Serhat Arslan, Stephen Ibanez, Alex Mallery, Changhoon Kim, and Nick McKeown. 2021. NanoTransport: A Low-Latency, Programmable Transport Layer for NICs. In Proceedings of the ACM SIGCOMM Symposium on SDN Research (SOSR). Association for Computing Machinery, New York, NY, USA, 13--26.

5. Junjie Bai , Fang Lu , Ke Zhang , 2019 . ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx. Junjie Bai, Fang Lu, Ke Zhang, et al. 2019. ONNX: Open Neural Network Exchange. https://github.com/onnx/onnx.