Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File Systems-Reference-Cited by-同舟云学术

Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File Systems

Published:2024-04-04 Issue:2 Volume:20 Page:1-42
ISSN:1553-3077
Container-title:ACM Transactions on Storage
language:en
Short-container-title:ACM Trans. Storage

Author:

Paul Arnab K.¹^ORCID,Neuwirth Sarah²^ORCID,Wadhwa Bharti³^ORCID,Wang Feiyi⁴^ORCID,Oral Sarp⁴^ORCID,Butt Ali R.⁵^ORCID

Affiliation:

1. BITS Pilani, KK Birla Goa Campus, Zuarinagar, India

2. Johannes Gutenberg University Mainz, Mainz, Germany

3. IBM Research, Yorktown Heights, USA

4. Oak Ridge National Laboratory, Oak Ridge, USA

5. Virginia Tech, Blacksburg, USA

Abstract

The imbalanced I/O load on large parallel file systems affects the parallel I/O performance of high-performance computing (HPC) applications. One of the main reasons for I/O imbalances is the lack of a global view of system-wide resource consumption. While approaches to address the problem already exist, the diversity of HPC workloads combined with different file striping patterns prevents widespread adoption of these approaches. In addition, load-balancing techniques should be transparent to client applications. To address these issues, we propose Tarazu , an end-to-end control plane where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages real-time load statistics for global data placement on distributed storage servers, while our design model employs trace-based optimization techniques to minimize latency for I/O load requests between clients and servers and to handle multiple striping patterns in files. We evaluate our proposed system on an experimental cluster for two common use cases: the synthetic I/O benchmark IOR and the scientific application I/O kernel HACC-I/O. We also use a discrete-time simulator with real HPC application traces from emerging workloads running on the Summit supercomputer to validate the effectiveness and scalability of Tarazu in large-scale storage environments. The results show improvements in load balancing and read performance of up to 33% and 43%, respectively, compared to the state-of-the-art.

Funder

National Science Foundation

Oak Ridge Leadership Computing Facility

National Center for Computational Sciences

Office of Science of the DOE

European High-Performance Computing Joint Undertaking

European Union’s Horizon 2020

BITS Pilani

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3641885

Reference110 articles.

1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 265–283. Retrieved from https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

2. Active Learning-based Automatic Tuning and Prediction of Parallel I/O Performance

3. Ravindra K. Ahuja. 2017. Network Flows: Theory, Algorithms, and Applications. Pearson Education, Chennai, India.

4. Ali Anwar. 2018. Towards Efficient and Flexible Object Storage Using Resource and Functional Partitioning. Ph. D. Dissertation. Virginia Tech.

5. MOS