Author:
Thanh Nguyen Chi, ,Yamada Koichi,Unehara Muneyuki
Abstract
Document clustering is a textmining technique for unsupervised document organization. It helps the users browse and navigate large sets of documents. Ho et al. proposed a Tolerance Rough Set Model (TRSM) [1] for improving the vector space model that represents documents by vectors of terms and applied it to document clustering. In this paper we analyze their model to propose a new model for efficient clustering of documents. We introduce Similarity Rough Set Model (SRSM) as another model for presenting documents in document clustering. The model is evaluated by experiments on test collections. The experiment results show that the SRSM document clusteringmethod outperforms the one with TRSM and the results of SRSM are less affected by the value of parameter than TRSM.
Publisher
Fuji Technology Press Ltd.
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Human-Computer Interaction
Reference18 articles.
1. T. B. Ho and K. Funakoshi, “Information retrieval using rough sets,” J. of Japanese Society for Aritificial Intelligence, Vol.13, No.3, pp. 424-433, 1997.
2. Y. Zhao and G. Karypis, “Hierarchical clustering algorithms for document datasets,” Data Mining and Knowledge Discovery, Vol.10, No.2, pp. 141-168, 2005.
3. I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparse text data using clustering,” Machine Learning, Vol.42, No.1-2, pp. 143-175, 2001.
4. M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” Proc. of the KDD Workshop on Text Mining, 2000.
5. Y. Li, S. M. Chung, and J. D. Holt, “Text document clustering based on frequent word meaning sequences,” Data and Knowledge Engineering, Vol.64, No.1, pp. 381-404, 2008.
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献