Abstract
Context—Anomaly detection in a data center is a challenging task, having to consider different services on various resources. Current literature shows the application of artificial intelligence and machine learning techniques to either log files or monitoring data: the former created by services at run time, while the latter produced by specific sensors directly on the physical or virtual machine. Objectives—We propose a model that exploits information both in log files and monitoring data to identify patterns and detect anomalies over time both at the service level and at the machine level. Methods—The key idea is to construct a specific dictionary for each log file which helps to extract anomalous n-grams in the feature matrix. Several techniques of Natural Language Processing, such as wordclouds and Topic modeling, have been used to enrich such dictionary. A clustering algorithm was then applied to the feature matrix to identify and group the various types of anomalies. On the other side, time series anomaly detection technique has been applied to sensors data in order to combine problems found in the log files with problems stored in the monitoring data. Several services (i.e., log files) running on the same machine have been grouped together with the monitoring metrics. Results—We have tested our approach on a real data center equipped with log files and monitoring data that can characterize the behaviour of physical and virtual resources in production. The data have been provided by the National Institute for Nuclear Physics in Italy. We have observed a correspondence between anomalies in log files and monitoring data, e.g., a decrease in memory usage or an increase in machine load. The results are extremely promising. Conclusions—Important outcomes have emerged thanks to the integration between these two types of data. Our model requires to integrate site administrators’ expertise in order to consider all critical scenarios in the data center and understand results properly.
Subject
Computer Networks and Communications,Human-Computer Interaction
Reference26 articles.
1. Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis
2. Identifying anomaly detection patterns from log files: A dynamic approach;Cavallaro,2021
3. Jump-starting multivariate time series anomaly detection for online service systems;Ma;Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21),2021
4. Experience report: Log mining using natural language processing and application to anomaly detection;Bertero;Proceedings of the 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE),2017
5. Anomaly detection of system logs based on natural language processing and deep learning;Wang;Proceedings of the 2018 4th International Conference on Frontiers of Signal Processing (ICFSP),2018
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献