S-RASTER: contraction clustering for evolving data streams-Reference-Cited by-同舟云学术

S-RASTER: contraction clustering for evolving data streams

Published:2020-08-13 Issue:1 Volume:7 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Ulm Gregor^ORCID,Smith Simon^ORCID,Nilsson Adrian^ORCID,Gustavsson Emil^ORCID,Jirstrand Mats^ORCID

Abstract

AbstractContraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.

Funder

VINNOVA

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-020-00336-3.pdf

Reference39 articles.

1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data. SIGMOD ’98. New York: ACM; 1998. p. 94–105.

2. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data. Data Min Knowl Discov. 2005;11(1):5–33.

3. Bação F, Lobo V, Painho M. Self-organizing maps as substitutes for k-means clustering. In: International conference on computational science. Berlin: Springer; 2005. p. 476–83.

4. Bär A, Finamore A, Casas P, Golab L, Mellia M. Large-scale network traffic monitoring with dbstream, a system for rolling big data analysis. In: 2014 IEEE international conference on big data (big data). New York: IEEE; 2014. p. 165–70.

5. Bifet A, Holmes G, Kirkby R, Pfahringer B. Moa: massive online analysis. J Mach Learn Res. 2010;11(May):1601–4.