Affiliation:
1. Georgia Institute of Technology, Atlanta, GA
2. Carnegie Mellon University, Pittsburgh, PA
Abstract
A common approach to clustering data is to view data objects as points in a metric space, and then to optimize a natural distance-based objective such as the
k
-median,
k
-means, or min-sum score. For applications such as clustering proteins by function or clustering images by subject, the implicit hope in taking this approach is that the optimal solution for the chosen objective will closely match the desired “target” clustering (e.g., a correct clustering of proteins by function or of images by who is in them). However, most distance-based objectives, including those mentioned here, are NP-hard to optimize. So, this assumption by itself is not sufficient, assuming P ≠ NP, to achieve clusterings of low-error via polynomial time algorithms.
In this article, we show that we can bypass this barrier if we slightly extend this assumption to ask that for some small constant
c
, not only the optimal solution, but also all
c
-approximations to the optimal solution, differ from the target on at most some ϵ fraction of points—we call this
(c,ϵ)-approximation-stability
. We show that under this condition, it is possible to efficiently obtain low-error clusterings even if the property holds only for values
c
for which the objective is known to be NP-hard to approximate. Specifically, for any constant
c > 1, (c,ϵ)
-approximation-stability of
k
-median or
k
-means objectives can be used to efficiently produce a clustering of error
O
(ϵ) with respect to the target clustering, as can stability of the min-sum objective if the target clusters are sufficiently large. Thus, we can perform nearly as well in terms of agreement with the target clustering
as if
we could approximate these objectives to this NP-hard value.
Funder
Division of Computing and Communication Foundations
Microsoft Research
Google
Publisher
Association for Computing Machinery (ACM)
Subject
Artificial Intelligence,Hardware and Architecture,Information Systems,Control and Systems Engineering,Software
Cited by
47 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Strategyproof Facility Location in Perturbation Stable Instances;Web and Internet Economics;2022
2. k
-center Clustering under Perturbation Resilience;ACM Transactions on Algorithms;2020-04-27
3. Index;Foundations of Data Science;2020-01-23
4. Background Material;Foundations of Data Science;2020-01-23
5. Wavelets;Foundations of Data Science;2020-01-23