Affiliation:
1. Department of Computer Science and Engineering, K.L.N. College of Engineering, India
2. Department of Electrical and Electronics Engineering, Thiagarajar College of Engineering, India
Abstract
In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC- K-means), to improve the performance of text document clustering is proposed. The first centroid is the document having the minimum standard deviation of its term frequency. Each of the other subsequent centroids is selected based on the dissimilarities of the previously selected centroids. For comparing the performance of the proposed DIC-DOC- K-means algorithm, the results of the K-means, K-means++ and weighted average of terms-based initial centroid selection + K-means (Weight_Avg_Initials + K-means) clustering algorithms are considered. The results show that the proposed DIC-DOC- K-means algorithm performs significantly better than the K-means, K-means++ and Weight_Avg_Initials+ K-means clustering algorithms for Reuters-21578 and WebKB with respect to purity, entropy and F-measure for most of the cluster sizes. The cluster sizes used for Reuters-8 are 8, 16, 24 and 32 and those for WebKB are 4, 8, 12 and 16. The results of the proposed DIC-DOC- K-means give a better performance for the number of clusters that are equal to the number of classes in the data set.
Subject
Library and Information Sciences,Information Systems
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献