Affiliation:
1. National University of Singapore
2. ByteDance Inc.
Abstract
Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively.
To this end, we propose
StreamOps
to enable cloud-native runtime management for streaming jobs in ByteDance.
StreamOps
has three main designs to address the challenges. 1) To allow for scalability,
StreamOps
is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management,
StreamOps
abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively,
StreamOps
features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building
StreamOps.
We evaluate
StreamOps
in our production environment, and the experiment results have further validated our system design.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference43 articles.
1. Daniel J Abadi Yanif Ahmad Magdalena Balazinska Ugur Cetintemel Mitch Cherniack Jeong-Hyon Hwang Wolfgang Lindner Anurag Maskey Alex Rasin Esther Ryvkina etal 2005. The design of the Borealis stream processing engine.. In CIDR. 277--289. Daniel J Abadi Yanif Ahmad Magdalena Balazinska Ugur Cetintemel Mitch Cherniack Jeong-Hyon Hwang Wolfgang Lindner Anurag Maskey Alex Rasin Esther Ryvkina et al. 2005. The design of the Borealis stream processing engine.. In CIDR. 277--289.
2. Aurora: a new model and architecture for data stream management
3. MillWheel
4. The dataflow model
5. The Stratosphere platform for big data analytics
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献