Affiliation:
1. Department of Operating Systems, Algebra University, 10000 Zagreb, Croatia
2. Department of Control and Computer Engineering, Faculty of Electrical Engineering and Computing, University of Zagreb, 10000 Zagreb, Croatia
Abstract
In the past twenty years, the IT industry has moved away from using physical servers for workload management to workloads consolidated via virtualization and, in the next iteration, further consolidated into containers. Later, container workloads based on Docker and Podman were orchestrated via Kubernetes or OpenShift. On the other hand, high-performance computing (HPC) environments have been lagging in this process, as much work is still needed to figure out how to apply containerization platforms for HPC. Containers have many advantages, as they tend to have less overhead while providing flexibility, modularity, and maintenance benefits. This makes them well-suited for tasks requiring a lot of computing power that are latency- or bandwidth-sensitive. But they are complex to manage, and many daily operations are based on command-line procedures that take years to master. This paper proposes a different architecture based on seamless hardware integration and a user-friendly UI (User Interface). It also offers dynamic workload placement based on real-time performance analysis and prediction and Machine Learning-based scheduling. This solves a prevalent issue in Kubernetes: the suboptimal placement of workloads without needing individual workload schedulers, as they are challenging to write and require much time to debug and test properly. It also enables us to focus on one of the key HPC issues—energy efficiency. Furthermore, the application we developed that implements this architecture helps with the Kubernetes installation process, which is fully automated, no matter which hardware platform we use—x86, ARM, and soon, RISC-V. The results we achieved using this architecture and application are very promising in two areas—the speed of workload scheduling and workload placement on a correct node. This also enables us to focus on one of the key HPC issues—energy efficiency.