Abstract
AbstractThe development of an LHC physics analysis involves numerous investigations that require the repeated processing of terabytes of data. Thus, a rapid completion of each of these analysis cycles is central to mastering the science project. We present a solution to efficiently handle and accelerate physics analyses on small-size institute clusters. Our solution uses three key concepts: vectorized processing of collision events, the “MapReduce” paradigm for scaling out on computing clusters, and efficiently utilized SSD caching to reduce latencies in IO operations. This work focuses on the latter key concept, its underlying mechanism, and its implementation. Using simulations from a Higgs pair production physics analysis as an example, we achieve an improvement factor of 6.3 in the runtime for reading all input data after one cycle and even an overall speedup of a factor of 14.9 after 10 cycles, reducing the runtime from hours to minutes.
Publisher
Springer Science and Business Media LLC
Subject
Nuclear and High Energy Physics,Computer Science (miscellaneous),Software
Reference16 articles.
1. Marcel R et al (2017) Design and execution of make-like, distributed Analyses based on Spotify’s Pipelining Package Luigi. arXiv:1706.00955 [physics.data-an]
2. Harris Charles R et al (2020) Array programming with NumPy. Nature. https://doi.org/10.1038/s41586-020-2649-2
3. Jeffrey D, Sanjay G (2004) “MapReduce: Simplified Data Processing on Large Clusters”. In: OSDI’04: Sixth Symposium on Operating System Design and Implementation. San Francisco, CA, pp. 137–150
4. Dask Development Team (2016) Dask: Library for dynamic task scheduling. https://dask.org. Accessed 26 May 2022
5. Lindsey G et al (2021) CoffeaTeam/coffea: Release v0.7.11. Version v0.7.11. https://doi.org/10.5281/zenodo.5762406
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Distributed Execution of Dask on HPC: A Case Study;2023 World Conference on Communication & Computing (WCONF);2023-07-14