Interpretable decision-tree induction in a big data parallel framework-Reference-Cited by-同舟云学术

Interpretable decision-tree induction in a big data parallel framework

Published:2017-12-20 Issue:4 Volume:27 Page:737-748
ISSN:2083-8492
Container-title:International Journal of Applied Mathematics and Computer Science
language:en
Short-container-title:

Author:

Weinberg Abraham Itzhak¹,Last Mark¹

Affiliation:

1. Department of Software and Information Systems Engineering Ben-Gurion University of the Negev, P.O.B. 653, Beer-Sheva 8410501, Israel

Abstract

Abstract When running data-mining algorithms on big data platforms, a parallel, distributed framework, such asMAPREDUCE, may be used. However, in a parallel framework, each individual model fits the data allocated to its own computing node without necessarily fitting the entire dataset. In order to induce a single consistent model, ensemble algorithms such as majority voting, aggregate the local models, rather than analyzing the entire dataset directly. Our goal is to develop an efficient algorithm for choosing one representative model from multiple, locally induced decision-tree models. The proposed SySM (syntactic similarity method) algorithm computes the similarity between the models produced by parallel nodes and chooses the model which is most similar to others as the best representative of the entire dataset. In 18.75% of 48 experiments on four big datasets, SySM accuracy is significantly higher than that of the ensemble; in about 43.75% of the experiments, SySM accuracy is significantly lower; in one case, the results are identical; and in the remaining 35.41% of cases the difference is not statistically significant. Compared with ensemble methods, the representative tree models selected by the proposed methodology are more compact and interpretable, their induction consumes less memory, and, as confirmed by the empirical results, they allow faster classification of new records.

Publisher

Walter de Gruyter GmbH

Subject

Applied Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Reference35 articles.

1. AlSabti, K., Ranka, S. and Singh, V. (1998). Clouds: Classification for large or out-of-core datasets, Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 2-8.

2. Amado, N., Gama, J. and Silva, F. (2001). Parallel implementation of decision tree learning algorithms, in P.10.1007/3-540-45329-6_4

3. Brazdil and A. Jorge (Eds.), Progress in Artificial Intelligence, Springer, Berlin/Heidelberg, pp. 6-13.

4. Amado, N., Gama, J. and Silva, F. (2003). Exploiting parallelism in decision tree induction, ECML/PKDDWorkshop on Parallel and Distributed Computing for Machine Learning, Cavtat/Dubrovnik, Croatia, pp. 13-22.

5. Andrzejak, A., Langner, F. and Zabala, S. (2013). Interpretable models from distributed data via merging of decision trees, IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Savannah, GA, USA, pp. 1-9.

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The Hybrid Cluster-And-Classify Approach;Studies in Big Data;2023

2. RfX: A Design Study for the Interactive Exploration of a Random Forest to Enhance Testing Procedures for Electrical Engines;Computer Graphics Forum;2022-03-07

3. Novel method for optimizing performance in resource constrained distributed data streams;Applied Intelligence;2022-02-16

4. Prediction of reservoir saturation field in high water cut stage by bore-ground electromagnetic method based on machine learning;Journal of Petroleum Science and Engineering;2021-09

5. The Evaluation of Online Education Course Performance Using Decision Tree Mining Algorithm;Complexity;2021-04-09