Big data decision tree for continuous-valued attributes based on unbalanced cut points-Reference-Cited by-同舟云学术

Big data decision tree for continuous-valued attributes based on unbalanced cut points

Published:2023-08-31 Issue:1 Volume:10 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Ma Shixiang,Zhai Junhai

Abstract

AbstractThe decision tree is a widely used decision support model, which can quickly mine effective decision rules based on the dataset. The decision tree induction algorithm for continuous-valued attributes, based on unbalanced cut points, is efficient for mining decision rules; however, extending it to big data remains an unresolved. In this paper, two solutions are proposed to solve this problem: the first one is based on partitioning instance subsets, whereas the second one uses partitioning attribute subsets. The crucial of these two solutions is how to find the global optimal cut point from the set of local optimal cut points. For the first solution, the calculation of the Gini index of the cut points between computing nodes and the selection of the global optimal cut point by communication between these computing nodes is proposed. However, in the second solution, the division of the big data into subsets using attribute subsets in a way that all cut points of an attribute are on the same map node is proposed, the local optimal cut points can be found in this map node, then the global optimal cut point can be obtained by summarizing all local optimal cut points in the reduce node. Finally, the proposed solutions are implemented with two big data platforms, Hadoop and Spark, and compared with three related algorithms on four datasets. Experimental results show that the proposed algorithms can not only effectively solve the scalability problem, but also have lowest running time, the fastest speed and the highest efficiency under the premise of preserving the classification performance.

Funder

the key R&D program of science and technology foundation of Hebei Province

the natural science foundation of Hebei Province

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-023-00816-2.pdf

Reference38 articles.

1. Roh Y, Heo G, Whang SE. A survey on data collection for machine learning: a Big Data-AI integration perspective. IEEE Trans Knowl Data Eng. 2021;33(4):1328–47.

2. Chu CT, Kim SK, Lin YA, et al. Map-reduce for machine learning on multicore. In: Proceedings of the 2006 conference, advances in neural information processing systems 19. MIT Press; 2007. p.281–8.

3. He Q, Zhuang FZ, Li JC, et al. Parallel implementation of classification algorithms based on MapReduce. RSKT 2010, lecture notes in computer science (LNAI,volume 6401). p. 655–62.

4. Xu Y, Qu W, Li Z, et al. Efficient K-means++ approximation with MapReduce. IEEE Trans Parallel Distrib Syst. 2014;25(12):3135–44.

5. Duan M, Li K, Liao X, et al. A parallel multiclassification algorithm for big data using an extreme learning machine. IEEE Trans Neural Netw Learn Syst. 2018;29(6):2337–51.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimal Layout Method for Roadside LiDAR and Camera;IEEE Access;2024