Author:
Kasa Siva Rajesh,Rajan Vaibhav
Abstract
AbstractClustering is a fundamental tool for exploratory data analysis, and is ubiquitous across scientific disciplines. Gaussian Mixture Model (GMM) is a popular probabilistic and interpretable model for clustering. In many practical settings, the true data distribution, which is unknown, may be non-Gaussian and may be contaminated by noise or outliers. In such cases, clustering may still be done with a misspecified GMM. However, this may lead to incorrect classification of the underlying subpopulations. In this paper, we identify and characterize the problem of inferior clustering solutions. Similar to well-known spurious solutions, these inferior solutions have high likelihood and poor cluster interpretation; however, they differ from spurious solutions in other characteristics, such as asymmetry in the fitted components. We theoretically analyze this asymmetry and its relation to misspecification. We propose a new penalty term that is designed to avoid both inferior and spurious solutions. Using this penalty term, we develop a new model selection criterion and a new GMM-based clustering algorithm, SIA. We empirically demonstrate that, in cases of misspecification, SIA avoids inferior solutions and outperforms previous GMM-based clustering methods.
Publisher
Springer Science and Business Media LLC
Reference54 articles.
1. McLachlan, G. J. & Peel, D. Finite Mixture Models (Wiley, 2000).
2. Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39, 1–22 (1977).
3. McLachlan, G. & Krishnan, T. The EM Algorithm and Extensions (Wiley, 2007).
4. Farcomeni, A. & Greco, L. Robust Methods for Data Reduction (CRC Press, 2016).
5. Dwivedi, R. et al. Theoretical guarantees for EM under misspecified Gaussian mixture models. Adv. Neural Inf. Process. Syst. 31, 9681–9689 (2018).