Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink-Reference-Cited by-同舟云学术

Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink

Published:2017-07-02 Issue:1 Volume:32 Page:61-73
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Kamburugamuve Supun¹,Wickramasinghe Pulasthi¹,Ekanayake Saliya²,Fox Geoffrey C¹

Affiliation:

1. School of Informatics and Computing Indiana University, Bloomington, IN, USA

2. Network Dynamics and Simulation Science Laboratory Biocomplexity Institute, Virginia Tech, Blacksburg, VA, USA

Abstract

With the ever-increasing need to analyze large amounts of data to get useful insights, it is essential to develop complex parallel machine learning algorithms that can scale with data and number of parallel processes. These algorithms need to run on large data sets as well as they need to be executed with minimal time in order to extract useful information in a time-constrained environment. Message passing interface (MPI) is a widely used model for developing such algorithms in high-performance computing paradigm, while Apache Spark and Apache Flink are emerging as big data platforms for large-scale parallel machine learning. Even though these big data frameworks are designed differently, they follow the data flow model for execution and user APIs. Data flow model offers fundamentally different capabilities than the MPI execution model, but the same type of parallelism can be used in applications developed in both models. This article presents three distinct machine learning algorithms implemented in MPI, Spark, and Flink and compares their performance and identifies strengths and weaknesses in each platform.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342017712976

Reference23 articles.

1. The dataflow model

2. Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm

3. Java thread and process performance for parallel machine learning on multicore HPC clusters

Cited by 25 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Performance Optimization of Machine Learning Algorithms Based on Spark;Applied Mathematics and Nonlinear Sciences;2024-01-01

2. High Performance Dataframes from Parallel Processing Patterns;Parallel Processing and Applied Mathematics;2023

3. Mobile Terminal Simulation of Network Guiding Innovation Platform;2022 International Conference on Augmented Intelligence and Sustainable Systems (ICAISS);2022-11-24

4. Dynamic Data Partitioning Strategy Based on Heterogeneous Flink Cluster;2022 5th International Conference on Artificial Intelligence and Big Data (ICAIBD);2022-05-27

5. Automatic Production Technology of Data News Based on Machine Learning Model;Wireless Communications and Mobile Computing;2022-02-11