Sparse Density Trees and Lists: An Interpretable Alternative to High-Dimensional Histograms-Reference-Cited by-同舟云学术

Sparse Density Trees and Lists: An Interpretable Alternative to High-Dimensional Histograms

Published:2024-04 Issue:1 Volume:3 Page:28-48
ISSN:2694-4022
Container-title:INFORMS Journal on Data Science
language:en
Short-container-title:INFORMS Journal on Data Science

Author:

Goh Siong Thye¹^ORCID,Semenova Lesia²^ORCID,Rudin Cynthia²^ORCID

Affiliation:

1. Lee Kong Chian School of Business, Singapore Management University, Singapore 178899;

2. Department of Computer Science, Duke University, Durham, North Carolina 27708

Abstract

We present sparse tree-based and list-based density estimation methods for binary/categorical data. Our density estimation models are higher-dimensional analogies to variable bin-width histograms. In each leaf of the tree (or list), the density is constant, similar to the flat density within the bin of a histogram. Histograms, however, cannot easily be visualized in more than two dimensions, whereas our models can. The accuracy of histograms fades as dimensions increase, whereas our models have priors that help with generalization. Our models are sparse, unlike high-dimensional fixed-bin histograms. We present three generative modeling methods, where the first one allows the user to specify the preferred number of leaves in the tree within a Bayesian prior. The second method allows the user to specify the preferred number of branches within the prior. The third method returns density lists (rather than trees) and allows the user to specify the preferred number of rules and the length of rules within the prior. The new approaches often yield a better balance between sparsity and accuracy of density estimates than other methods for this task. We present an application to crime analysis, where we estimate how unusual each type of modus operandi is for a house break-in. History: David Martens served as senior editor for this article. Funding: The authors acknowledge support from NIDA [Grant R01 DA054994]. Data Ethics & Reproducibility Note: There are no ethical issues with this algorithm that we are aware of. Data sets for testing the algorithm are either simulated or publicly available through the UCI Machine Learning Repository (Markelle Kelly, Rachel Longjohn, Kolby Nottingham, The UCI Machine Learning Repository, https://archive.ics.uci.edu ). The housebreak data were obtained through the Cambridge Police Department, Cambridge, MA. The code capsule is available on Code Ocean at https://doi.org/10.24433/CO.2985251.v1 and in the e-companion to this article (available at https://doi.org/10.1287/ijds.2021.0001 ).

Publisher

Institute for Operations Research and the Management Sciences (INFORMS)

Link

https://pubsonline.informs.org/doi/pdf/10.1287/ijds.2021.0001

Reference39 articles.

1. An approximation to the density function

2. Density Estimation Trees as fast non-parametric modelling tools

3. Estimation of a multivariate density