Abstract
Large-scale learning algorithms are essential for modern data collections that may have billions of data points. Here we study the design of parallel
\(k\)
-clustering algorithms, which include the
\(k\)
-median,
\(k\)
-medoids, and
\(k\)
-means clustering problems. We design efficient parallel algorithms for these problems and prove that they still compute constant-factor approximations to the optimal solution for stable clustering instances. In addition to our theoretic results we present computational experiments that show that our
\(k\)
-median and
\(k\)
-means algorithms work well in practice - we are able to find better clusterings than state-of-the-art coreset constructions using samples of the same size.
Publisher
Association for Computing Machinery (ACM)
Reference39 articles.
1. Nir Ailon Ragesh Jaiswal and Claire Monteleoni. 2009. Streaming K-Means Approximation. In NIPS. 10–18.
2. David Arthur and Sergei Vassilvitskii. 2007. K-Means++: The Advantages of Careful Seeding. In SODA. 1027–1035.
3. Vijay Arya Naveen Garg Rohit Khandekar Adam Meyerson Kamesh Munagala and Vinayaka Pandit. 2001. Local Search Heuristics for K-median and Facility Location Problems. In STOC. 21–29.
4. Hassan Ashtiani Shrinu Kushagra and Shai Ben-David. 2016. Clustering with Same-Cluster Queries. In NIPS. 3216–3224.
5. Center-based clustering under perturbation stability