Sparse Density Trees and Lists: An Interpretable Alternative to High-Dimensional Histograms

Author:

Goh Siong Thye1ORCID,Semenova Lesia2ORCID,Rudin Cynthia2ORCID

Affiliation:

1. Lee Kong Chian School of Business, Singapore Management University, Singapore 178899;

2. Department of Computer Science, Duke University, Durham, North Carolina 27708

Abstract

We present sparse tree-based and list-based density estimation methods for binary/categorical data. Our density estimation models are higher-dimensional analogies to variable bin-width histograms. In each leaf of the tree (or list), the density is constant, similar to the flat density within the bin of a histogram. Histograms, however, cannot easily be visualized in more than two dimensions, whereas our models can. The accuracy of histograms fades as dimensions increase, whereas our models have priors that help with generalization. Our models are sparse, unlike high-dimensional fixed-bin histograms. We present three generative modeling methods, where the first one allows the user to specify the preferred number of leaves in the tree within a Bayesian prior. The second method allows the user to specify the preferred number of branches within the prior. The third method returns density lists (rather than trees) and allows the user to specify the preferred number of rules and the length of rules within the prior. The new approaches often yield a better balance between sparsity and accuracy of density estimates than other methods for this task. We present an application to crime analysis, where we estimate how unusual each type of modus operandi is for a house break-in. History: David Martens served as senior editor for this article. Funding: The authors acknowledge support from NIDA [Grant R01 DA054994]. Data Ethics & Reproducibility Note: There are no ethical issues with this algorithm that we are aware of. Data sets for testing the algorithm are either simulated or publicly available through the UCI Machine Learning Repository (Markelle Kelly, Rachel Longjohn, Kolby Nottingham, The UCI Machine Learning Repository, https://archive.ics.uci.edu ). The housebreak data were obtained through the Cambridge Police Department, Cambridge, MA. The code capsule is available on Code Ocean at https://doi.org/10.24433/CO.2985251.v1 and in the e-companion to this article (available at https://doi.org/10.1287/ijds.2021.0001 ).

Publisher

Institute for Operations Research and the Management Sciences (INFORMS)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3