Author:
Ukey Nimish,Zhang Guangjian,Yang Zhengyi,Li Binghao,Li Wei,Zhang Wenjie
Abstract
AbstractGiven a user dataset $$\varvec{U}$$
U
and an object dataset $$\varvec{I}$$
I
, a kNN join query in high-dimensional space returns the $$\varvec{k}$$
k
nearest neighbors of each object in dataset $$\varvec{U}$$
U
from the object dataset $$\varvec{I}$$
I
. The kNN join is a basic and necessary operation in many applications, such as databases, data mining, computer vision, multi-media, machine learning, recommendation systems, and many more. In the real world, datasets frequently update dynamically as objects are added or removed. In this paper, we propose novel methods of continuous kNN join over dynamic high-dimensional data. We firstly propose the HDR$$^+$$
+
Tree, which supports more efficient insertion, deletion, and batch update. Further observed that the existing methods rely on globally correlated datasets for effective dimensionality reduction, we then propose the HDR Forest. It clusters the dataset and constructs multiple HDR Trees to capture local correlations among the data. As a result, our HDR Forest is able to process non-globally correlated datasets efficiently. Two novel optimisations are applied to the proposed HDR Forest, including the precomputation of the PCA states of data items and pruning-based kNN recomputation during item deletion. For the completeness of the work, we also present the proof of computing distances in reduced dimensions of PCA in HDR Tree. Extensive experiments on real-world datasets show that the proposed methods and optimisations outperform the baseline algorithms of naive RkNN join and HDR Tree.
Funder
University of New South Wales
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Hardware and Architecture,Software
Reference55 articles.
1. Dasarathy, B.V.: Nearest neighbor (nn) norms: Nn pattern classification techniques. IEEE Computer Society Tutorial (1991)
2. Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient knn classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Sys. 29(5), 1774–1785 (2017)
3. Zhou, C., Tham, C.-K.: Graphel: A graph-based ensemble learning method for distributed diagnostics and prognostics in the industrial internet of things. In: 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS), pp. 903–909 IEEE (2018)
4. Hartigan, J.A., Wong, M.A.: Algorithm as 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser C (applied statistics) 28(1), 100–108 (1979)
5. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)