Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models-Reference-Cited by-同舟云学术

Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Published:2022-05-19 Issue:1 Volume:9 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Ahmed Nasim^ORCID,Barczak Andre L. C.,Rashid Mohammad A.,Susnjak Teo

Abstract

AbstractDue to the rapid growth of available data, various platforms offer parallel infrastructure that efficiently processes big data. One of the critical issues is how to use these platforms to optimise resources, and for this reason, performance prediction has been an important topic in the last few years. There are two main approaches to the problem of predicting performance. One is to fit data into an equation based on a analytical models. The other is to use machine learning (ML) in the form of regression algorithms. In this paper, we have investigated the difference in accuracy for these two approaches. While our experiments used an open-source platform called Apache Spark, the results obtained by this research are applicable to any parallel platform and are not constrained to this technology. We found that gradient boost, an ML regressor, is more accurate than any of the existing analytical models as long as the range of the prediction follows that of the training. We have investigated analytical and ML models based on interpolation and extrapolation methods with k-fold cross-validation techniques. Using the interpolation method, two analytical models, namely 2D-plate and fully-connected models, outperform older analytical models and kernel ridge regression algorithm but not the gradient boost regression algorithm. We found the average accuracy of 2D-plate and fully-connected models using interpolation are 0.962 and 0.961. However, when using the extrapolation method, the analytical models are much more accurate than the ML regressors, particularly two of the most recently proposed models (2D-plate and fully-connected). Both models are based on the communication patterns between the nodes. We found that using extrapolation, kernel ridge, gradient boost and two proposed analytical models average accuracy is 0.466, 0.677, 0.975, and 0.981, respectively. This study shows that practitioners can benefit from analytical models by being able to accurately predict the runtime outside of the range of the training data using only a few experimental operations.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-022-00623-1.pdf

Reference58 articles.

1. Ghani NA, Hamid S, Hashem IAT, Ahmed E. Social media big data analytics: a survey. Comput Hum Behav. 2019;101:417–28.

2. Fang R, Pouyanfar S, Yang Y, Chen S-C, Iyengar S. Computational health informatics in the big data age: a survey. ACM Comput Surv. 2016;49(1):1–36.

3. Hirschberg J, Manning CD. Advances in natural language processing. Science. 2015;349(6245):261–6.

4. Maros A, Murai F, da Silva APC, Almeida JM, Lattuada M, Gianniti E, Hosseini M, Ardagna D. Machine learning for performance prediction of spark cloud applications. In: 2019 IEEE 12th international conference on cloud computing (CLOUD). New York: IEEE; 2019. p. 99–106.

5. Salloum S, Dautov R, Chen X, Peng PX, Huang JZ. Big data analytics on apache spark. Int J Data Sci Anal. 2016;1(3):145–64.

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Performance optimization of Spark MLlib workloads using cost efficient RICG model on exponential projective sampling;Cluster Computing;2024-05-08

2. RFCPredicModel: Prediction Algorithm of Precision Medicine in Healthcare with Big Data;Communications in Computer and Information Science;2024

3. PM100: A Job Power Consumption Dataset of a Large-scale Production HPC System;Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis;2023-11-12

4. A Novel Multi-Task Performance Prediction Model for Spark;Applied Sciences;2023-11-11

5. Predicting Sales Using Performance Comparison of Different Algorithms in Regression Algorithms;2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE);2023-08-18