A Fast Parallel Random Forest Algorithm Based on Spark-Reference-Cited by-同舟云学术

A Fast Parallel Random Forest Algorithm Based on Spark

Published:2023-05-17 Issue:10 Volume:13 Page:6121
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Yin Linzi¹^ORCID,Chen Ken¹,Jiang Zhaohui²,Xu Xuemei¹

Affiliation:

1. School of Physics and Electronics, Central South University, Changsha 410012, China

2. School of Automation, Central South University, Changsha 410012, China

Abstract

To improve the computational efficiency and classification accuracy in the context of big data, an optimized parallel random forest algorithm is proposed based on the Spark computing framework. First, a new Gini coefficient is defined to reduce the impact of feature redundancy for higher classification accuracy. Next, to reduce the number of candidate split points and Gini coefficient calculations for continuous features, an approximate equal-frequency binning method is proposed to determine the optimal split points efficiently. Finally, based on Apache Spark computing framework, the forest sampling index (FSI) table is defined to speed up the parallel training process of decision trees and reduce data communication overhead. Experimental results show that the proposed algorithm improves the efficiency of constructing random forests while ensuring classification accuracy, and is superior to Spark-MLRF in terms of performance and scalability.

Funder

National Natural Science Foundation of China

Provincial Natural Science Foundation of Hunan

Central South University

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/10/6121/pdf

Reference31 articles.

1. Random forests;Breiman;Mach. Learn.,2001

2. Sensitivity and specificity of information criteria;Dziak;Brief. Bioinform.,2020

3. Ali, M.A.S., Orban, R., and Rajammal Ramasamy, R. (2022). A Novel Method for Survival Prediction of Hepatocellular Carcinoma Using Feature-Selection Techniques. Appl. Sci., 12.

4. Phan, T.N., Kuch, V., and Lehnert, L.W. (2020). Land Cover Classification using Google Earth Engine and Random Forest Classifier—The Role of Image Composition. Remote Sens., 12.

5. Zheng, X., Jia, J., Chen, J., Guo, S., Sun, L., Zhou, C., and Wang, Y. (2022). Hyperspectral Image Classification with Imbalanced Data Based on Semi-Supervised Learning. Appl. Sci., 12.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Forest in the Clouds: Navigating Big Data with GRP and RFC;Lecture Notes in Networks and Systems;2024