Abstract
How can we efficiently and scalably cluster high-dimensional data? Thek-means algorithm clusters data by iteratively reducing intra-cluster Euclidean distances until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold (k-means for high-dimensional data), a scalable algorithm fork-means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied tok-means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters high-dimensional data, achieving approximately one order of magnitude improvement over prior art.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference48 articles.
1. How I came up with the discrete cosine transform
2. Nir Ailon , Ragesh Jaiswal , and Claire Monteleoni . 2009. Streaming k-means approximation. NeurIPS 22 ( 2009 ). Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. 2009. Streaming k-means approximation. NeurIPS 22 (2009).
3. Harry C. Andrews and William K. Pratt. 1968. Fourier transform coding of images . In Proc. Hawaii Int. Conf. System Sciences. 677--679 . Harry C. Andrews and William K. Pratt. 1968. Fourier transform coding of images. In Proc. Hawaii Int. Conf. System Sciences. 677--679.
4. Scalable k-means++
5. Luca Becchetti Marc Bury Vincent Cohen-Addad Fabrizio Grandoni and Chris Schwiegelshohn. 2019. Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma. In STOC. 1039--1050. Luca Becchetti Marc Bury Vincent Cohen-Addad Fabrizio Grandoni and Chris Schwiegelshohn. 2019. Oblivious dimension reduction for k -means: beyond subspaces and the Johnson-Lindenstrauss lemma. In STOC. 1039--1050.
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献