Abstract
AbstractThe decision tree is a widely used decision support model, which can quickly mine effective decision rules based on the dataset. The decision tree induction algorithm for continuous-valued attributes, based on unbalanced cut points, is efficient for mining decision rules; however, extending it to big data remains an unresolved. In this paper, two solutions are proposed to solve this problem: the first one is based on partitioning instance subsets, whereas the second one uses partitioning attribute subsets. The crucial of these two solutions is how to find the global optimal cut point from the set of local optimal cut points. For the first solution, the calculation of the Gini index of the cut points between computing nodes and the selection of the global optimal cut point by communication between these computing nodes is proposed. However, in the second solution, the division of the big data into subsets using attribute subsets in a way that all cut points of an attribute are on the same map node is proposed, the local optimal cut points can be found in this map node, then the global optimal cut point can be obtained by summarizing all local optimal cut points in the reduce node. Finally, the proposed solutions are implemented with two big data platforms, Hadoop and Spark, and compared with three related algorithms on four datasets. Experimental results show that the proposed algorithms can not only effectively solve the scalability problem, but also have lowest running time, the fastest speed and the highest efficiency under the premise of preserving the classification performance.
Funder
the key R&D program of science and technology foundation of Hebei Province
the natural science foundation of Hebei Province
Publisher
Springer Science and Business Media LLC
Subject
Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献