Abstract
The exploitation of large databases implies the investment of expensive resources both in terms of the storage and processing time. The correct assessment of the data implies that pre-processing steps be taken before its analysis. The transformation of categorical data by adequately encoding every instance of categorical variables is needed. Encoding must be implemented that preserves the actual patterns while avoiding the introduction of non-existing ones. The authors discuss CESAMO, an algorithm which allows us to statistically identify the pattern preserving codes. The resulting database is more economical and may encompass mixed databases. Thus, they obtain an optimal transformed representation that is considerably more compact without impairing its informational content. For the equivalence of the original (FD) and reduced data set (RD), they apply an algorithm that relies on a multivariate regression algorithm (AA). Through the combined application of CESAMO and AA, the equivalent behavior of both FD and RD may be guaranteed with a high degree of statistical certainty.
Reference53 articles.
1. Amidan, B. G., Ferryman, T. A., & Cooley, S. K. (2005, March). Data outlier detection using the Chebyshev theorem. In Aerospace Conference, 2005 IEEE (pp. 3814-3819). IEEE. doi:10.1109/AERO.2005.1559688
2. Barbará, D., Li, Y., & Couto, J. (2002, November). COOLCAT: an entropy-based algorithm for categorical clustering. In Proceedings of the eleventh international conference on Information and knowledge management (pp. 582-589). ACM. doi:10.1145/584792.584888
3. Pearson correlation coefficient;J.Benesty;Noise reduction in speech processing,2009
4. A Survey of Clustering Data Mining Techniques
5. Cluster Validity with Fuzzy Sets.;J. C.Bezdek;Journal of Cybernetics,1974