Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially-Reference-Cited by-同舟云学术

Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially

Published:2019-03 Issue:7 Volume:12 Page:766-778
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Ceccarello Matteo¹,Pietracaprina Andrea²,Pucci Geppino²

Affiliation:

1. IT University and BARC, Copenhagen, Denmark

2. University of Padova, Padova, Italy

Abstract

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular k -center variant which, given a set S of points from some metric space and a parameter k < | S |, requires to identify a subset of k centers in S minimizing the maximum distance of any point of S from its closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter z and allows up to z points of S (outliers) to be disregarded when computing the maximum distance from the centers. We present coreset-based 2-round MapReduce algorithms for the above two formulations of the problem, and a 1-pass Streaming algorithm for the case with outliers. For any fixed ϵ > 0, the algorithms yield solutions whose approximation ratios are a mere additive term ϵ away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) D . These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3317315.3317319

Cited by 29 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MapReduce algorithms for robust center-based clustering in doubling metrics;Journal of Parallel and Distributed Computing;2024-12

2. Fast and Accurate Fair k-Center Clustering in Doubling Metrics;Proceedings of the ACM Web Conference 2024;2024-05-13

3. New algorithms for fair k-center problem with outliers and capacity constraints;Theoretical Computer Science;2024-05

4. Massively parallel and streaming algorithms for balanced clustering;Theoretical Computer Science;2024-02

5. Streaming Fair k-Center Clustering over Massive Dataset with Performance Guarantee;Lecture Notes in Computer Science;2024