Significant DBSCAN+: Statistically Robust Density-based Clustering

Author:

Xie Yiqun1,Jia Xiaowei2,Shekhar Shashi3,Bao Han4,Zhou Xun4

Affiliation:

1. University of Maryland, College Park, MD

2. University of Pittsburgh, S. Bouquet Street Pittsburgh, PA

3. University of Minnesota, Minneapolis, MN

4. University of Iowa, Iowa City, IA

Abstract

Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single “hotspot” of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10–20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.

Funder

NSF

Google’s AI for Social Good Impact Scholars program

Dean’s Research Initiative Award at the University of Maryland

USGS

Pitt Momentum Fund Award

USDOD

USDOE

NIH

USDA

Minnesota Super computing Institute

Safety Research using Simulation University Transportation Center

US-DOT’s University Transportation Centers Program

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Theoretical Computer Science

Reference45 articles.

1. 2020. HDBSCAN. Retrieved from https://hdbscan.readthedocs.io/en/latest/index.html. 2020. HDBSCAN. Retrieved from https://hdbscan.readthedocs.io/en/latest/index.html.

2. 2020. National Cancer Institute. Retrieved from https://surveillance.cancer.gov/satscan/. 2020. National Cancer Institute. Retrieved from https://surveillance.cancer.gov/satscan/.

3. 2020. SaTScan. Retrieved from https://www.satscan.org/. 2020. SaTScan. Retrieved from https://www.satscan.org/.

4. 2021. Keras implementation for Deep Embedding Clustering (DEC). Retrieved from https://github.com/XifengGuo/DEC-keras. 2021. Keras implementation for Deep Embedding Clustering (DEC). Retrieved from https://github.com/XifengGuo/DEC-keras.

5. Spatio-Temporal Data Mining

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Enhanced scan statistic with tightened window for detecting irregularly shaped hotspots;International Journal of Geographical Information Science;2024-09-05

2. A new method for predicting precipitation δ 18 O distribution based on deep learning and spatio-temporal clustering;Hydrological Sciences Journal;2024-08

3. EVALUATION OF PROVINCES IN TÜRKİYE WITH HEALTH INDICATORS BY DENSITY-BASED SPATIAL CLUSTERING ANALYSIS;Anadolu Üniversitesi İktisadi ve İdari Bilimler Fakültesi Dergisi;2024-06-30

4. Isomorphic Graph Embedding for Progressive Maximal Frequent Subgraph Mining;ACM Transactions on Intelligent Systems and Technology;2023-12-19

5. Spatial hotspot detection in the presence of global spatial autocorrelation;International Journal of Geographical Information Science;2023-06-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3