Affiliation:
1. Enginering, System and Applications Laboratory, Ecole Nationale des Sciences Appliquées, Sidi Mohamed Ben Abdellah University
2. Euromed Center of Research, Euromed Polytechnic School, Euromed University of Fes
Abstract
Abstract
Kmeans is one of the most algorithms that are utilized in data clustering. Number of metrics is coupled with kmeans in order cluster data targeting the enhancement of both locally clusters compactness and the globally clusters separation. Then, before the ultimate data assignment to their corresponding clusters, the selection of the optimal number of clusters should constitute a crucial step in the clustering process. The present work aims to build up a new clustering metric/heuristic that takes into account both space dispersion and inferential characteristics of the data to be clustered. Hence, in this paper, a Geometry-Inference based Clustering (GIC) heuristic is proposed for selecting the optimal numbers of clusters. The conceptual approach proposes the “Initial speed rate” as the main geometric parameter to be inferentially studied. After, the corresponding histograms are fitted by means of classical distributions. A clear linear behaviour regarding the distributions’ parameters was detected according to the number of optimal clusters k* for each of the 14 datasets adopted in this work. Finally, for each dataset, the optimal k* is observed to match with the change-points assigned as the intersection of two clearly salient lines. All fittings are tested using Khi2 tests showing excellent fitting in terms of p-values, and R² also for linear fittings. Then, a change-point algorithm is launched to select k*. To sum up, the GIC heuristic shows a full quantitative aspect, and is fully automated; no qualitative index or graphical techniques are used herein.
Publisher
Research Square Platform LLC
Reference41 articles.
1. Ahmed M, Choudhury N, Uddin S (2017) Anomaly detection on big data in financial markets. In: 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, pp 998–1001
2. A survey of anomaly detection techniques in financial domain;Ahmed M;Future Gener Comput Syst,2016
3. New urban map of Eurasia using MODIS and multi-source geospatial data;Alsaaideh B;Geo-Spat Inf Sci,2017
4. CPI-model-based analysis of sparse k-means clustering algorithms;Aoyama K;Int J Data Sci Anal,2021
5. Overcoming the Heuristic Nature of k -Means Clustering: Identification and Characterization of Binding Modes from Simulations of Molecular Recognition Complexes;Bremer PL;J Chem Inf Model,2020