Fault-tolerance in the borealis distributed stream processing system

Author:

Balazinska Magdalena1,Balakrishnan Hari2,Madden Samuel R.2,Stonebraker Michael2

Affiliation:

1. University of Washington, Seattle, WA

2. Massachusetts Institute of Technology, Cambridge, MA

Abstract

Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e.g., in network security monitoring), a significant challenge arises: SPEs must be able to handle various software and hardware faults that occur, masking them to provide high availability (HA). In this article, we develop, implement, and evaluate DPC (Delay, Process, and Correct), a protocol to handle crash failures of processing nodes and network failures in a distributed SPE. Like previous approaches to HA, DPC uses replication and masks many types of node and network failures. In the presence of network partitions, the designer of any replication system faces a choice between providing availability or data consistency across the replicas. In DPC, this choice is made explicit: the user specifies an availability bound (no result should be delayed by more than a specified delay threshold even under failure if the corresponding input is available), and DPC attempts to minimize the resulting inconsistency between replicas (not all of which might have seen the input data) while meeting the given delay threshold. Although conceptually simple, the DPC protocol tolerates the occurrence of multiple simultaneous failures as well as any further failures that occur during recovery. This article describes DPC and its implementation in the Borealis SPE. We show that DPC enables a distributed SPE to maintain low-latency processing at all times, while also achieving eventual consistency, where applications eventually receive the complete and correct output streams. Furthermore, we show that, independent of system size and failure location, it is possible to handle failures almost up-to the user-specified bound in a manner that meets the required availability without introducing any inconsistency.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Reference48 articles.

1. Aurora: a new model and architecture for data stream management

2. Aleri. http://www.aleri.com/index.html. Aleri. http://www.aleri.com/index.html.

Cited by 90 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. DIBA: A Re-Configurable Stream Processor;IEEE Transactions on Knowledge and Data Engineering;2024-09

2. Fault Tolerance Placement in the Internet of Things;Proceedings of the ACM on Management of Data;2024-05-29

3. Coordination-Free Replicated Datalog Streams with Application-Specific Availability;Lecture Notes in Computer Science;2024

4. A survey on the evolution of stream processing systems;The VLDB Journal;2023-11-22

5. Streaming State Validation Technique for Textual Big Data Using Apache Flink;Computational Linguistics and Intelligent Text Processing;2023

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3