Data Stream Clustering: An In-depth Empirical Study-Reference-Cited by-同舟云学术

Data Stream Clustering: An In-depth Empirical Study

Published:2023-06-13 Issue:2 Volume:1 Page:1-26
ISSN:2836-6573
Container-title:Proceedings of the ACM on Management of Data
language:en
Short-container-title:Proc. ACM Manag. Data

Author:

Wang Xin¹^ORCID,Wang Zhengru²^ORCID,Wu Zhenyu³^ORCID,Zhang Shuhao⁴^ORCID,Shi Xuanhua⁵^ORCID,Lu Li⁶^ORCID

Affiliation:

1. Ohio State University, Columbus, OH, USA

2. Nvidia, Shanghai, China

3. University of Manchester, Manchester, United Kingdom

4. Singapore University of Technology and Design, Singapore, Singapore

5. Huazhong University of Science and Technology, Wuhan, China

6. Sichuan University, Chengdu, China

Abstract

Data Stream Clustering (DSC) plays an important role in mining continuous and unlabeled data streams in real-world applications. Over the last decades, numerous DSC algorithms have been proposed with promising clustering accuracy and efficiency. Despite the significant differences among existing DSC algorithms, they are commonly built around four key design aspects: summarizing data structure, window model, outlier detection mechanism, and offline refinement strategy. However, there is a lack of empirical studies on these key design aspects in the same codebase using real-world workloads with distinct characteristics. As a result, it is difficult for researchers to improve upon the state-of-the-art. In this paper, we conduct such a study of DSC on its four key design aspects. We implemented state-of-the-art variants of all of these design choices in an open-sourced platform from scratch and evaluated them using both real-world and synthetic workloads. Our analysis identifies the fundamental issues and trade-offs of each design choice in terms of both accuracy and efficiency. We even find that combining flexible design choices led to the development of a new algorithm called Benne, which can be tuned to achieve either better accuracy or better efficiency compared to the state-of-the-art.

Funder

Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2

National Research Foundation, Singapore and Infocomm Media Development Authority under its Future Communications Research & Development Programme

Key R&D Program of Hubei

National Key R&D Program of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3589307

Reference40 articles.

1. [n.d.]. Covertype. http:// archive.ics.uci.edu/ ml/ datasets/ Covertype. [n.d.]. Covertype. http:// archive.ics.uci.edu/ ml/ datasets/ Covertype.

2. [n.d.]. Sensor. https:// www.cse.fau.edu/ xqzhu/ stream.html. [n.d.]. Sensor. https:// www.cse.fau.edu/ xqzhu/ stream.html.

3. [n.d.]. Ticat https:// github.com/ innerr/ ticat. [n.d.]. Ticat https:// github.com/ innerr/ ticat.

4. Marcel R. Ackermann and et al . 2012 . StreamKM: A Clustering Algorithm for Data Streams. ACM J. Exp. Algorithmics 17 (May 2012), 30. Marcel R. Ackermann and et al. 2012. StreamKM: A Clustering Algorithm for Data Streams. ACM J. Exp. Algorithmics 17 (May 2012), 30.

5. Charu C. Aggarwal , Jiawei Han , Jianyong Wang , and Philip S. Yu . 2003. A Framework for Clustering Evolving Data Streams . In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29 (Berlin, Germany) (VLDB '03). VLDB Endowment, 81--92. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A Framework for Clustering Evolving Data Streams. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29 (Berlin, Germany) (VLDB '03). VLDB Endowment, 81--92.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Ocean: Online Clustering and Evolution Analysis for Dynamic Streaming Data;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

2. An Efficient Fuzzy Stream Clustering Method Based on Granular-Ball Structure;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13