Author:
Li Min,Yang Mengyuan,Chen Pengfei
Abstract
With the growing demand for data computation and communication, the size and complexity of communication networks have grown significantly. However, due to hardware and software problems, in a large-scale communication network (e.g., telecommunication network), the daily alarm events are massive, e.g., millions of alarms occur in a serious failure, which contains crucial information such as the time, content, and device of exceptions. With the expansion of the communication network, the number of components and their interactions become more complex, leading to numerous alarm events and complex alarm propagation. Moreover, these alarm events are redundant and consume much effort to resolve. To reduce alarms and pinpoint root causes from them, we propose a data-driven and unsupervised alarm analysis framework, which can effectively compress massive alarm events and improve the efficiency of root cause localization. In our framework, an offline learning procedure obtains results of association reduction based on a period of historical alarms. Then, an online analysis procedure matches and compresses real-time alarms and generates root cause groups. The evaluation is based on real communication network alarms from telecom operators, and the results show that our method can associate and reduce communication network alarms with an accuracy of more than 91%, reducing more than 62% of redundant alarms. In addition, we validate it on fault data coming from a microservices system, and it achieves an accuracy of 95% in root cause location. Compared with existing methods, the proposed method is more suitable for operation and maintenance analysis in communication networks.
Funder
National Key Research and Development Program of China
National Natural Science Foundation of China
Basic and Applied Basic Research Foundation of Guangdong Province
Fundamental Research Funds for the Central Universities
Subject
Computer Science Applications,Computer Vision and Pattern Recognition,Human-Computer Interaction,Computer Science (miscellaneous)
Reference75 articles.
1. Combining knowledge modeling and machine learning for alarm root cause analysis;Abele;IFAC Proc,2013
2. “Rule-based expert systems,”;Abraham;Proceedings of the International Conference on Systems, Man and Cybernetics,1988
3. On clustering massive data streams: a summarization paradigm;Aggarwal;SIGMOD Record,2003
4. A review of alarm root cause analysis in process industries: common methods, recent research status and challenges;Alinezhad;Chem. Eng. Res. Design,2022
5. Multiclass data classification using fault detection-based techniques;Basha;Comp. Chem. Eng,2020
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Common Cause Failures in Communication Networks;2024 Systems of Signals Generating and Processing in the Field of on Board Communications;2024-03-12