Predicting Failures of Autoscaling Distributed Applications

Author:

Denaro Giovanni1ORCID,El Moussa Noura2ORCID,Heydarov Rahim3ORCID,Lomio Francesco4ORCID,Pezzè Mauro2ORCID,Qiu Ketai3ORCID

Affiliation:

1. University of Milano-Bicocca, Milan, Italy

2. Università della Svizzera Italiana (USI), Lugano, Switzerland / Constructor Institute Schaffhausen, Schaffhausen, Switzerland

3. Università della Svizzera Italiana (USI), Lugano, Switzerland

4. Constructor Institute Schaffhausen, Schaffhausen, Switzerland

Abstract

Predicting failures in production environments allows service providers to activate countermeasures that prevent harming the users of the applications. The most successful approaches predict failures from error states that the current approaches identify from anomalies in time series of fixed sets of KPI values collected at runtime. They cannot handle time series of KPI sets with size that varies over time. Thus these approaches work with applications that run on statically configured sets of components and computational nodes, and do not scale up to the many popular cloud applications that exploit autoscaling. This paper proposes Preface, a novel approach to predict failures in cloud applications that exploit autoscaling. Preface originally augments the neural-network-based failure predictors successfully exploited to predict failures in statically configured applications, with a Rectifier layer that handles KPI sets of highly variable size as the ones collected in cloud autoscaling applications, and reduces those KPIs to a set of rectified-KPIs of fixed size that can be fed to the neural-network predictor. The Preface Rectifier computes the rectified-KPIs as descriptive statistics of the original KPIs, for each logical component of the target application. The descriptive statistics shrink the highly variable sets of KPIs collected at different timestamps to a fixed set of values compatible with the input nodes of the neural-network failure predictor. The neural network can then reveal anomalies that correspond to error states, before they propagate to failures that harm the users of the applications. The experiments on both a commercial application and a widely used academic exemplar confirm that Preface can indeed predict many harmful failures early enough to activate proper countermeasures.

Funder

MUR, Ministero Università e Ricerca

SNF Swiss National Foundation

Publisher

Association for Computing Machinery (ACM)

Reference40 articles.

1. Unsupervised real-time anomaly detection for streaming data

2. Understanding of a convolutional neural network

3. Basic concepts and taxonomy of dependable and secure computing

4. Fingerprinting the datacenter

5. Loïc Bontemps, James McDermott, and Nhien-An Le-Khac. 2016. Collective anomaly detection based on long short-term memory recurrent neural networks. In International Conference on Future Data and Security engineering. 141–152.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3