Temporal silhouette: validation of stream clustering robust to concept drift
-
Published:2023-11-10
Issue:
Volume:
Page:
-
ISSN:0885-6125
-
Container-title:Machine Learning
-
language:en
-
Short-container-title:Mach Learn
Author:
Iglesias Vázquez Félix,Zseby Tanja
Abstract
AbstractStream clustering is required in applications where data is generated continuously or periodically and must be processed considering its temporal nature. In the absence of a ground truth, internal validation is the only option to evaluate the quality of performances. Traditional internal validation is commonly used also in stream clustering, even in spite of the fact that it becomes inconsistent in the event of data evolution. Recent trends opt for incremental approaches, but these are closer to change detection rather than validation methods and limit themselves by imposing online validation on online analysis. In this work we study the impact of concept drift in the validation of stream clustering and propose the Temporal Silhouette index, therefore making internal validation conform to streaming data. We conduct tests with more than 200 datasets and contrast performances of four popular stream clustering algorithms with seven validation methods (three static internal, three incremental internal, one external) and the proposed index. Results show the suitability of the Temporal Silhouette index for stream clustering validation in the event of concept drift and different types of outliers. The demand for reliable unsupervised learning in applications that process data in streams is ever-increasing, and such reliability inevitably requires the use of validation. This fact highlights the significance of the novel approach proposed in this work.
Publisher
Springer Science and Business Media LLC
Subject
Artificial Intelligence,Software
Reference46 articles.
1. Ackermann, M. R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., & Sohler, C. (2012). Streamkm++: A clustering algorithm for data streams. ACM J Exp Algorithmics, 17, 1–2. 2. Aggarwal, C.C., Han, J., Wang, J., & Yu, P.S. (2003). A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB Endowment, VLDB ’03, p 81–92. 3. Aggarwal, C.C., Han, J., Wang, J., & Yu, P.S. (2007). On clustering massive data streams: A summarization paradigm. In Data Streams, Springer, pp 9–38. 4. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256. 5. Bezdek, J. C., & Keller, J. M. (2021). Streaming data analysis: Clustering or classification? IEEE Trans on Systems, Man, and Cybernetics: Systems, 51(1), 91–102.
|
|