FINEX: A Fast Index for Exact & Flexible Density-Based Clustering

Author:

Thiel Konstantin Emil1ORCID,Kocher Daniel1ORCID,Augsten Nikolaus1ORCID,Hütter Thomas1ORCID,Mann Willi2ORCID,Schmitt Daniel1ORCID

Affiliation:

1. University of Salzburg, Salzburg, Austria

2. Celonis SE, Munich, Austria

Abstract

Density-based clustering aims to find groups of similar objects (i.e., clusters) in a given dataset. Applications include, e.g., process mining and anomaly detection. It comes with two user parameters (ε, MinPts) that determine the clustering result, but are typically unknown in advance. Thus, users need to interactively test various settings until satisfying clusterings are found. However, existing solutions suffer from the following limitations: (a) Ineffective pruning of expensive neighborhood computations. (b) Approximate clustering, where objects are falsely labeled noise. (c) Restricted parameter tuning that is limited to ε whereas MinPts is constant, which reduces the explorable clusterings. (d) Inflexibility in terms of applicable data types and distance functions. We propose FINEX, a linear-space index that overcomes these limitations. Our index provides exact clusterings and can be queried with either of the two parameters. FINEX avoids neighborhood computations where possible and reduces the complexities of the remaining computations by leveraging fundamental properties of density-based clusters. Hence, our solution is efficient and flexible regarding data types and distance functions. Moreover, FINEX respects the original and straightforward notion of density-based clustering. In our experiments on 12 large real-world datasets from various domains, FINEX frequently outperforms state-of-the-art techniques for exact clustering by orders of magnitude.

Funder

Austrian Science Fund

Publisher

Association for Computing Machinery (ACM)

Reference26 articles.

1. Online Hierarchical Clustering in a Data Warehouse Environment

2. OPTICS

3. Nikolaus Augsten and Michael Bohlen . 2013. Similarity Joins in Relational Database Systems 3 rd ed.). San Rafael : Morgan & Claypool Publishers . Nikolaus Augsten and Michael Bohlen. 2013. Similarity Joins in Relational Database Systems 3rd ed.). San Rafael: Morgan & Claypool Publishers.

4. Multidimensional binary search trees used for associative searching

5. Stefan Brecheisen , Hans-Peter Kriegel , and Martin Pfeifle . 2006. Parallel Density-Based Clustering of Complex Objects . In Advances in Knowledge Discovery and Data Mining (PAKDD '06) . Springer Berlin Heidelberg , 179--188. Stefan Brecheisen, Hans-Peter Kriegel, and Martin Pfeifle. 2006. Parallel Density-Based Clustering of Complex Objects. In Advances in Knowledge Discovery and Data Mining (PAKDD '06). Springer Berlin Heidelberg, 179--188.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3