Scalable hierarchical clustering by composition rank vector encoding and tree structure-Reference-Cited by-同舟云学术

Scalable hierarchical clustering by composition rank vector encoding and tree structure

Published:2020-04-13 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Lai Xiao,Tian Pu^ORCID

Abstract

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Publisher

Cold Spring Harbor Laboratory

Reference14 articles.

1. Least squares quantization in PCM

2. Mean shift: A robust approach toward feature space analysis;P., M;IEEE Transaction on Pattern Analysis and Machine Intelligence,2002

3. Ester, M. ; Kriegel, H.-P. ; Sander, J. ; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noises. 1996, 226–31.

4. Cure: an efficient clustering algorithm for large databases

5. Clustering by Passing Messages Between Data Points