Author:
Liu Junjie,Jiang Rongxin,Liu Xuesong,Zhou Fan,Chen Yaowu,Shen Chen
Abstract
AbstractDespite the promising progress that has been made, large-scale clustering tasks still face various challenges: (i) high time and space complexity in K-nearest neighbors (KNN), which is often overlooked by most methods, and (ii) low recall rate caused by simply splitting the dataset. In this paper, we propose a novel framework for large-scale clustering tasks named large-scale clustering via recall KNN and subgraph segmentation (LS-RKSS) to perform faster clustering with guaranteed clustering performance, which embraces the ability of handling large-scale data up to 100 million using a single T4 GPU with less than 10% of the running time. We propose recall KNN (RKNN) and subgraph segmentation (SS) to effectively address the primary challenges in large-scale clustering tasks. Firstly, the recall KNN is proposed to perform efficient similarity search among dense vectors with lower time and space complexity compared to traditional exact search methods of KNN. Then, the subgraph segmentation is proposed to split the whole dataset into multiple subgraphs based on the recall KNN. Given the recall rate of RKNN based on traditional exact search methods, it is theoretically proved that dividing the dataset into multiple subgraphs using recall KNN and subgraph segmentation is a more reasonable and effective approach. Finally, clusters are generated independently on each subgraph, and the final clustering result is obtained by combining the results of all subgraphs. Extensive experiments demonstrate that LS-RKSS outperforms previous large-scale clustering methods in both effectiveness and efficiency.
Funder
Zhejiang Provincial Natural Science Foundation of China
Publisher
Springer Science and Business Media LLC