Unsupervised discretization by two-dimensional MDL-based histogram-Reference-Cited by-同舟云学术

Unsupervised discretization by two-dimensional MDL-based histogram

Published:2023-02-16 Issue:7 Volume:112 Page:2397-2431
ISSN:0885-6125
Container-title:Machine Learning
language:en
Short-container-title:Mach Learn

Author:

Yang Lincen^ORCID,Baratchi Mitra^ORCID,van Leeuwen Matthijs^ORCID

Abstract

AbstractUnsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalized maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, whichpartitions each dimensionalternately and thenmerges neighboring regions, all using the MDL principle. Experiments on synthetic data show that PALM (1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; (2) approximates well a wide range of partitions outside the model class; (3) converges, in contrast to the state-of-the-art multivariate discretization method IPD. Finally, we apply our algorithm to three spatial datasets, and we demonstrate that, compared to kernel density estimation (KDE), our algorithm not only reveals more detailed density changes, but also fits unseen data better, as measured by the log-likelihood.

Funder

Dutch Research Council (NWO).

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Software

Link

https://link.springer.com/content/pdf/10.1007/s10994-022-06294-6.pdf

Reference44 articles.

1. Bay, S. D. (2001). Multivariate discretization for set mining. Knowledge and Information Systems, 3(4), 491–512.

2. Biba, M., Esposito, F., Ferilli, S., Di Mauro, N., & Basile, T. M. A. (2007). Unsupervised discretization using kernel density estimation. In Proceedings of the 20th international joint conference on artifical intelligence (pp. 696–701), Morgan Kaufmann Publishers Inc., San Francisco, IJCAI’07.

3. Boulle, M. (2004). Khiops: A statistical discretization method of continuous attributes. Machine learning, 55(1), 53–69.

4. Boullé, M. (2006). Modl: A Bayes optimal discretization method for continuous attributes. Machine Learning, 65(1), 131–165.

5. Cao, F., Ge, Y., & Wang, J. (2014). Spatial data discretization methods for geocomputation. International Journal of Applied Earth Observation and Geoinformation, 26, 432–440.