Abstract
AbstractThe increasingly growing scale of modern computing infrastructures solicits more ingenious and automatic solutions to their management. Our work focuses on file transfer failures within the Worldwide Large Hadron Collider Computing Grid and proposes a pipeline to support distributed data management operations by suggesting potential issues to investigate. Specifically, we adopt an unsupervised learning approach leveraging Natural Language Processing and Machine Learning tools to automatically parse error messages and group similar failures. The results are presented in the form of a summary table containing the most common textual patterns and time evolution charts. This approach has two main advantages. First, the joint elaboration of the error string and the transfer’s source/destination enables more informative and compact troubleshooting, as opposed to inspecting each site and checking unique messages separately. As a by-product, this also reduces the number of errors to check by some orders of magnitude (from unique error strings to unique categories or patterns). Second, the time evolution plots allow operators to immediately filter out secondary issues (e.g. transient or in resolution) and focus on the most serious problems first (e.g. escalating failures). As a preliminary assessment, we compare our results with the Global Grid User Support ticketing system, showing that most of our suggestions are indeed real issues (direct association), while being able to cover 89% of reported incidents (inverse relationship).
Funder
Alma Mater Studiorum - Università di Bologna
Publisher
Springer Science and Business Media LLC
Subject
Nuclear and High Energy Physics,Computer Science (miscellaneous),Software
Reference41 articles.
1. Albalawi R, Yeap TH, Benyoucef M (2020) Using topic modeling methods for short-text data: a comparative analysis. Front Artif Intell. https://doi.org/10.3389/frai.2020.00042
2. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2):49–60. https://doi.org/10.1145/304181.304187
3. Antoni T, Bühler W, Dres H, Grein G, Roth M (2008) Global grid user support—building a worldwide distributed user support infrastructure. J Phys: Conf Ser 119(5):052002. https://doi.org/10.1088/1742-6596/119/5/052002
4. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms
5. ATLAS Collaboration: The atlas experiment at the cern large hadron collider. Journal of instrumentation 3:S08003 (2008)