Affiliation:
1. School of Computer Science, University of South China, Hengyang 421200, China
Abstract
Multiobjective clustering algorithm using particle swarm optimization has been applied successfully in some applications. However, existing algorithms are implemented on a single machine and cannot be directly parallelized on a cluster, which makes it difficult for existing algorithms to handle large-scale data. With the development of distributed parallel computing framework, data parallelism was proposed. However, the increase in parallelism will lead to the problem of unbalanced data distribution affecting the clustering effect. In this paper, we propose a parallel multiobjective PSO weighted average clustering algorithm based on apache Spark (Spark-MOPSO-Avg). First, the entire data set is divided into multiple partitions and cached in memory using the distributed parallel and memory-based computing of Apache Spark. The local fitness value of the particle is calculated in parallel according to the data in the partition. After the calculation is completed, only particle information is transmitted, and there is no need to transmit a large number of data objects between each node, reducing the communication of data in the network and thus effectively reducing the algorithm’s running time. Second, a weighted average calculation of the local fitness values is performed to improve the problem of unbalanced data distribution affecting the results. Experimental results show that the Spark-MOPSO-Avg algorithm achieves lower information loss under data parallelism, losing about 1% to 9% accuracy, but can effectively reduce the algorithm time overhead. It shows good execution efficiency and parallel computing capability under the Spark distributed cluster.
Funder
National Natural Science Foundation of China
Natural Science Foundation of Hunan Province
Research Foundation of Education Bureau of Hunan Province
Hengyang Science and Technology Major Project
Subject
General Physics and Astronomy
Reference31 articles.
1. A survey of kernel and spectral methods for clustering;Filippone;Pattern Recognit.,2008
2. Application of k-means and hierarchical clustering techniques for analysis of air pollution: A review (1980–2019);Govender;Atmos. Pollut. Res.,2020
3. Data clustering: A review;Jain;ACM Comput. Surv. (CSUR),1999
4. McDowell, I.C., Manandhar, D., Vockley, C.M., Schmid, A.K., Reddy, T.E., and Engelhardt, B.E. (2018). Clustering gene expression time series data using an infinite Gaussian process mixture model. PLoS Comput. Biol., 14.
5. Chen, C.Y., and Ye, F. (2012, January 2–3). Particle swarm optimization algorithm and its application to clustering analysis. Proceedings of the 2012 17th Conference on Electrical Power Distribution, Tehran, Iran.
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献