Affiliation:
1. Johns Hopkins University
Abstract
R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R
base
package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by
(i)
evaluating matrix operations lazily,
(ii)
performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O,
(iii)
performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H
2
O and Spark MLlib by a factor of 3 -- 20.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献