Affiliation:
1. CIFASIS, French Argentine International Center for Information and Systems
Sciences, CONICET-UNR
Abstract
Abstract
We present a new clustering validation technique named: "Hypothesis Learning". We build our method on three concepts: 1) clustering cohesion, 2) clustering dispersion and, 3) quality of hypothesis. The first two notions focus on individual clusters quality. We measure them using a classifier estimating the tightness and separation as a likelihood. The third notion evaluates the complexity of learning the clustering partition. Similar to cohesion and dispersion, we get a likelihood value. Next, we aggregate these three measures to find a single index reporting clustering quality. Our work's core is the use of learning algorithms as means to estimate these three indexes. In our experiments, we tested "Hypothesis Learning" with a fast classifier, K Nearest Neighbour (KNN). However, in the discussion of the method, we explore other classifiers like CART and Random Forest. Furthermore, we provide a novel approach from previous validation methods mixing supervised, unsupervised algorithms and stability concepts. For instance, our method is based on using clusters probabilities to calculate likelihoods. Also, we show how to regularize a classifier to handle overfit, thus making the use of stability optional. Finally, we present experimental results comparing our approach with a similar method and many other well-known clustering indexes.
Publisher
Research Square Platform LLC
Reference37 articles.
1. Tibshirani, R. and Walther, G. (2005) Cluster Validation by Prediction Strength. Journal of Computational and Graphical Statistics 14(3): 511-528 https://doi.org/10.1198/106186005X59243, Taylor & Francis
2. Tibshirani, R. and Walther, G. and Hastie, T. (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(2): 411-423 https://doi.org/https://doi.org/10.1111/1467-9868.00293
3. Lange, T. and Roth, V. and Braun, M. L. and Buhmann, J. M. (2004) Stability-Based Validation of Clustering Solutions. Neural Comput. 16(6): 1299 –1323 https://doi.org/10.1162/089976604773717621, 25, Cambridge, MA, USA, MIT Press, June 2004
4. Ben-Hur, A. and Elisseeff, A. and Guyon, I. (2002) A stability based method for discovering structure in clustered data. Pacific Symposium on Biocomputing, Pacific Symposium on Biocomputing, 6 –17
5. Krzanowski, W. J. and Lai, Y. T. (1988) A Criterion for Determining the Number of Groups in a Data Set Using Sum of Squares Clustering. Biometrics 44(1): 23--24 https://doi.org/10.2307/2531893, 13