Author:
Barroso Vasco Chibante,Elia Domenico,Grigoras Costin,Gomez Ramirez Andres,Vino Gioacchino,Wegrzynek Adam
Abstract
ALICE (A Large Ion Collider Experiment) is preparing for a major upgrade of the detector, readout and computing systemsfor LHC Run 3. A new facility called O2 (Online-Offline) will play a major role in data compression and event processing. To efficiently operate the experiment, we are designing a monitoring subsystem, which will provide a complete overview of the O2 overall health, detect performance degradation and component failures. The monitoring subsystem will receive and collect up to 600 kHz of performance metrics. It consists of a custom monitoring library and a server-side, distributed software covering five main functional tasks: parameter collection and processing, storage, visualisation and alarms. To select the most appropriate tools for these tasks, we evaluated three options: “Modular Stack”, Zabbix and the currently used ALICE Grid monitoring tool called MonALISA. The former one consists of a toolkit including collectd, Apache Flume, Apache Spark, InfluxDB, Grafana and Riemann. This paper describes the monitoring subsystem functional architecture. It goes through a complete evaluation of the three considered options, the selection process, risk assessment and justification for the final decision. The in-depth comparison includes functional features and throughput measurement to ensure the required processing and storage performance.
Reference18 articles.
1. ALICE Collaboration,
Technical Design Report for the Upgrade of the Online–Offline Computing System,CERN-LHCC-2015-006
(2015)
2. Common Readout Unit (CRU) - A new readout architecture for the ALICE experiment
3. ALICE O2 monitoring library,
https://github.com/AliceO2Group/Monitoring,
accessed 2018-10-10
4. MonALISA,
http://monalisa.caltech.edu,
Accessed: 2018-10-10
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献