Clustering Validation Inference


Figuera Pau1,Cuzzocrea Alfredo2,García Bringas Pablo1ORCID


1. Faculty of Engineering, University of Deusto, 48007 Bilbao, Spain

2. iDEA Lab, University of Calabria, 87036 Rende, Italy


Clustering validation is applied to evaluate the quality of classifications. This step is crucial for unsupervised machine learning. A plethora of methods exist for this purpose; however, a common drawback is that statistical inference is not possible. In this study, we construct a density function for the cluster number. For this purpose, we use smooth techniques. Then, we apply non-negative matrix factorization using the Kullback–Leibler divergence. Employing a unique linearly independent uncorrelated observational variable hypothesis, we construct a sequence by varying the dimension of the span space of the factorization only using analytical techniques. The expectation of the limit of this sequence follows a gamma probability density function. Then, identifying the dimension of the factorization of the space span with clusters, we transform the estimation of the suitable dimension of the factorization into a probabilistic estimate of the number of clusters. This approach is an internal validation method that is suitable for numerical and categorical multivariate data and independent of the clustering technique. Our main achievement is a predictive clustering validation model with graphical abilities. It provides results in terms of credibility, thus making it possible to compare results such as expert judgment on a quantitative basis.



Reference75 articles.

1. MacQueen, J. (1965–7, January 27). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA.

2. Data clustering: 50 years beyond K-means;Pattern Recognition Letters,2010

3. Aggarwal, C.C. (2014). Clustering: Algorithms and Applications, CRC Press Taylor and Francis Group.

4. A probabilistic theory of clustering;Dougherty;Pattern Recognit.,2004

5. Deng, H., and Han, J. (2018). Probabilistic models for clustering. Data Clustering, CRC.







Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3