Abstract
We propose a non-parametric method to cluster mixed data containing both continuous and discrete random variables. The product space of the continuous and discrete sample space is transformed into a new product space based on adaptive quantization on the continuous part. Detection of cluster patterns on the product space is determined locally by using a weighted modified chi-squared test. Our algorithm does not require any user input since the number of clusters is determined automatically by data. Simulation studies and real data analysis results show that our proposed method outperforms the benchmark method, AutoClass, in various settings.
Subject
General Physics and Astronomy
Reference12 articles.
1. Kaufman, L., and Rousseeuw, P.J. (2009). Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons.
2. Model-based Gaussian and non-Gaussian clustering;Banfield;Biometrics,1993
3. Bradley, P.S., Fayyad, U.M., and Reina, C.A. (1998). Scaling EM (Expectation-Maximization) Clustering to Large Databases, Microsoft Research.
4. How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis;Fraley;Comput. J.,1998
5. Extensions to the k-means algorithm for clustering large data sets with categorical values;Huang;Data Min. Knowl. Discov.,1998