Abstract
AbstractOne of the key challenges ofk-means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such ask-means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex datasets such as those obtained from molecular simulation,k-means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors ofk-means++ will lead to a lack of reproducibility.K-meansN-Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficientn-ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helpingk-means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse datasets and be used as a standalone tool or as part of our MDANCE clustering package.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献