Author:
Taamneh Salah,Qawasmeh Ahmad,Aljammal Ashraf H.
Abstract
K-means algorithm is a well-known unsupervised machine learning tool that aims at splitting a given dataset into a fixed number of clusters via iterative refinement approach. Running such an algorithm on today’s datasets that are characterized by its high multidimensionality and huge size requires using fault-tolerance mechanisms to mitigate the impact of possible failures. In this paper, we propose an actor-based implementation of k-means algorithm. The algorithm was made fault-tolerant by periodically saving the centroids into a stable storage during the failure-free execution, and restarting from the last saved centroids upon a failure. This was implemented in two different ways: optimistic checkpointing (blocking) and pessimistic checkpointing (non-blocking). The actor-based k-means algorithm was evaluated on a machine with eight cores. The experiments showed that the proposed algorithm scales very well as the number of workers increases, and can be up to ∼ 2x faster than a Java-thread-based implementation of k-means algorithm. The results also showed that the optimistic algorithm outperformed the pessimistic one, specifically, in the presence of competing I/O operations. Several failures were forced to occur during the execution to evaluate the performance of the fault-tolerant implementations. The experiments showed that the average amount of lost work ranged from 3–6%.
Reference31 articles.
1. W. Zhao, H. Ma and Q. He, Parallel k-means clustering based on mapreduce, in: IEEE International Conference on Cloud Computing, Springer, Berlin, Heidelberg, 2009, pp. 674–679.
2. K. Stoffel and A. Belkoniene, Parallel k/h-means clustering for large data sets, in: European Conference on Parallel Processing, Springer, Berlin, Heidelberg, 1999, pp. 1451–1454.
3. Parallel k-means clustering algorithm on NOWs;Kantabutra;NECTEC Technical Journal,2000
4. Z. Lv, Y. Hu, H. Zhong, J. Wu, B. Li and H. Zhao, Parallel k-means clustering of remote sensing images based on mapreduce, in: International Conference on Web Information Systems and Mining, Springer, Berlin, Heidelberg, 2010, pp. 162–170.
5. The study of parallel k-means algorithm;Zhang;2006 6th World Congress on Intelligent Control and Automation,2006
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献