The Parallel Fuzzy C-Median Clustering Algorithm Using the Spark for the Big Data

Author:

Mallik Moksud Alam1ORCID

Affiliation:

1. International Islamic University Malaysia, Kuala Lumpur, Malaysia and Lords Institute of Engineering & Technology ,Hyderabad,India

Abstract

Abstract Big data for sustainable development is a global issue due to the explosive growth of data and according to the forecasting of International Data Corporation(IDC), the amount of data in the world will double every 18 months, and the Global Data-sphere is expected to more than double in size from 2022 to 2026. The analysis, processing, and storing of big data is a challenging research concern due to data imperfection, massive data size, computational difficulty, and lengthy evaluation time. Clustering is a fundamental technique in data analysis and data mining, and it becomes particularly challenging when dealing with big data due to the sheer volume, velocity, and variety of the data. When the size of the data is exceedingly enormous, clustering has a scalability issue that causes it to utilize more memory and take longer to process data. Big Data frameworks like Hadoop MapReduce and Spark are potent tools that provide an effective way to analyze huge datasets that are being processed by the Hadoop cluster. But Hadoop reads and writes data from the Hadoop Distributed File System (HDFS) for each iteration, which consumes considerable time. Apache Spark is one of the most widely used large-scale data processing engines due to its speed, low latency in-memory computing, and powerful analytics. Therefore, we develop a Parallel Fuzzy C-Median Clustering Algorithm Using the Spark for the Big Data that can handle large datasets while maintaining high accuracy and scalability. The algorithm employs a distance-based clustering approach to determine the similarity between data points and group them in combination with sampling and partitioning techniques. In the sampling phase, a representative subset of the dataset is selected, and in the partitioning phase, the data is partitioned into smaller subsets that can be clustered in parallel across multiple nodes. The suggested method, implemented in the Databricks cloud platform provides high clustering accuracy, as measured by clustering evaluation metrics such as the silhouette coefficient, cost function, partition index, and clustering entropy. The experimental results show that c = 5, which is consistent for cost function with the ideal silhouette coefficient of 1, is the optimal number of clusters for this dataset. For the validation of the proposed algorithm, a comparative study is done by implementing the other contemporary algorithms for the same dataset. The comparison analysis exhibits that our suggested approach outperforms the others, especially for computational time. The developed approach is the benchmarked with the existing methods such as MiniBatchKmeans, AffinityPropagation, SpectralClustering, Ward, OPTICS, and BRICH in terms of silhouette index and cost function.

Publisher

Research Square Platform LLC

Reference108 articles.

1. D. Reinsel, J. Gantz, and J. Rydning, “Data Age 2025: The Evolution of Data to Life-Critical,” 2017. [Online]. Available: https://assets.ey.com/content/dam/ey-sites/ey-com/en_gl/topics/workforce/Seagate-WP-DataAge2025-March-2017.pdf. [Accessed: 03-Oct-2010].

2. Ikegwu, A. C., Nweke, H. F., Anikwe, C. V., Alo, U. R., & Okonkwo,O.R.(2022a).Big Data Analytics for data-driven industry: A review of data sources, tools, challenges, solutions, and Research Directions. Cluster Computing, 25(5), 3343–3387. https://doi.org/10.1007/s10586-022-03568-5.

3. (2020). Peer Review #3 of “Big Data Clustering Techniques Based on Spark: A Literature Review (v0.1).” https://doi.org/10.7287/peerj-cs.321v0.1/reviews/3

4. Cooley, R., Mobasher, B., and Srivastava, J., \Web mining: information andpattern discovery on the world wide web," in Tools with Arti_cial Intel-ligence, 1997. Proceedings., Ninth IEEE International Conferenceon, pp. 558{567, Nov 1997.

5. Ansari, Z., Azeem, M. F., Babu, A. V., and Waseem, A., \A fuzzy approachfor feature evaluation and dimensionality reduction to improve the quality ofweb usage mining results," International Journal on Advanced Science,Engineering and Information Technology, vol. 2, no. 6, pp. 67{73, 2012.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3