Apache Nemo: A Framework for Optimizing Distributed Data Processing
-
Published:2020-11-30
Issue:3-4
Volume:38
Page:1-31
-
ISSN:0734-2071
-
Container-title:ACM Transactions on Computer Systems
-
language:en
-
Short-container-title:ACM Trans. Comput. Syst.
Author:
Song Won Wook1,
Yang Youngseok1,
Eo Jeongyoon1,
Seo Jangho2,
Kim Joo Yeon3,
Lee Sanha2,
Lee Gyewon1,
Um Taegeon1,
Cho Haeyoon1,
Chun Byung-Gon1
Affiliation:
1. Seoul National University, Seoul, Rep. of Korea
2. Naver Corporation, Gyeonggi-do, Rep. of Korea
3. Samsung Electronics, Seoul, Rep. of Korea
Abstract
Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control.
We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at
https://nemo.apache.org
as an Apache incubator project.
Funder
Institute of Information & Communications Technology Planning & Evaluation
Korea government
BK21 FOUR Intelligence Computing
National Research Foundation of Korea
Publisher
Association for Computing Machinery (ACM)
Subject
General Computer Science
Reference57 articles.
1. Bert Hubert. 2020. Linux Traffic Control. Retrieved from https://lartc.org/manpages/tc.txt.
2. CAIDA. 2020. The CAIDA Anonymized Internet Traces 2016 Dataset. Retrieved from https://www.caida.org/data/passive/passive_2016_dataset.xml.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Blaze: Holistic Caching for Iterative Data Processing;Proceedings of the Nineteenth European Conference on Computer Systems;2024-04-22
2. SWAN;Proceedings of the 13th ACM SIGOPS Asia-Pacific Workshop on Systems;2022-08-23