Affiliation:
1. University of Maryland, College Park, MD
2. University of Pittsburgh, S. Bouquet Street Pittsburgh, PA
3. University of Minnesota, Minneapolis, MN
4. University of Iowa, Iowa City, IA
Abstract
Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.
Funder
NSF
Google’s AI for Social Good Impact Scholars program
Dean’s Research Initiative Award at the University of Maryland
USGS
Pitt Momentum Fund Award
USDOD
USDOE
NIH
USDA
Minnesota Super computing Institute
Safety Research using Simulation University Transportation Center
US-DOT’s University Transportation Centers Program
Publisher
Association for Computing Machinery (ACM)
Subject
Artificial Intelligence,Theoretical Computer Science
Reference45 articles.
1. 2020. HDBSCAN. Retrieved from https://hdbscan.readthedocs.io/en/latest/index.html. 2020. HDBSCAN. Retrieved from https://hdbscan.readthedocs.io/en/latest/index.html.
2. 2020. National Cancer Institute. Retrieved from https://surveillance.cancer.gov/satscan/. 2020. National Cancer Institute. Retrieved from https://surveillance.cancer.gov/satscan/.
3. 2020. SaTScan. Retrieved from https://www.satscan.org/. 2020. SaTScan. Retrieved from https://www.satscan.org/.
4. 2021. Keras implementation for Deep Embedding Clustering (DEC). Retrieved from https://github.com/XifengGuo/DEC-keras. 2021. Keras implementation for Deep Embedding Clustering (DEC). Retrieved from https://github.com/XifengGuo/DEC-keras.
5. Spatio-Temporal Data Mining
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献