Affiliation:
1. Voronezh State Technical University
Abstract
Clustering is one of the first standard steps for big data analysis. It is necessary for further solving problems of classification and group forecasting. We study a viscous modification of the gravitational data clustering algorithm (VGSA), which develop already proven approach. Individual data records are considered in VGSA as points in multidimensional space, between which a paired central attraction acts. The masses of the interacting points are assumed to be the same, which corresponds to the specifics of clustering, in contrast to the problem of finding the optimal value of the objective function, in which the masses of particles increase as they approach the extremum. The choice of the type of pair interaction depending on the proposed data structure is discussed. The presence of high viscosity lowers the order of the dynamic equations of motion by excluding acceleration from them. The obtained shortened equations define the stable motion of the system, which guarantees the reproduction of the results when the algorithm is restarted. The stability of the system of equations is proved using the Lyapunov function, which is an analogue of the physical potential energy. Turning off the interaction of particles at small distances between them provides an automatic mechanism for hierarchical clustering at different stages of the algorithm with the final formation of a single cluster. The relationship between VGSA and the operating principle of Kohonen's self-organizing maps, which corresponds to the gravitational redistribution of test particles, is traced. The performance of the algorithm has been tested on the database in comparison with the methods of K-means clustering, Kohonen maps and the standard gravity algorithm. The speed and accuracy of clustering were evaluated. The conclusion is made about the advantage of applying VGSA to big data, taking into account the automatic determination of the number of clusters, the possibility of correction when updating records, and inaccurate data specification.
Reference18 articles.
1. Suárez J.L., García S., Herrera F. A Tutorial on Distance Metric Learning: Mathematical Foundations, Algorithms, Experimental Analysis, Prospects and Challenges. Neurocomputing, 2021, vol. 425, pp. 300–322. DOI: 10.1016/j.neucom.2020.08.017.
2. Geron A. Hands-On Machine Learning with Scikit-Learn and TensorFlow. O`Reilly Media, 2017. 574 p. (Russ. ed.: Geron A. Hands-On Machine Learning with Scikit-Learn and TensorFlow. Saint Petersburg, Dialektika Publ., 2020. 690 p.).
3. Dawani J. Hands-On Mathematics for Deep Learning: Build a Solid Mathematical Foundation for Training Efficient Deep Neural Networks. Birmingham, Packt Publishing, 2020. 364 p.
4. Ezugwu A.E., Ikotun A.M., Oyelade O.O., Abualigah L., Agushaka J.O., Eke Ch.I., Akinyelu A.A., A Comprehensive Survey of Clustering Algorithms: Stateof-the-art Machine Learning Applications, Taxonomy, Challenges, and Future Research Prospects. Engineering Applications of Artificial Intelligence, 2022, vol. 110, pp. 104743. DOI: 10.1016/j.engappai.2022.104743.
5. Aggarwal C.C., Reddy Ch.K. (eds). Data Clustering. Algorithms and Applications. New York, CRC Press, 2014. 652 p.