Abstract
AbstractScalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users’ queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters’ connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Software
Reference51 articles.
1. Jaiswal A, Janwe N (2011) Hierarchical document clustering: a review In: 2nd National Conference on Information and Communication Technology (NCICT) 2011 Proceedings published in International Journal of Computer Applications$\circledR $(IJCA), 37–41.
2. Roul RK, Asthana SR, Sahay SK (2015) Automated document indexing via intelligent hierarchical clustering: A novel approach In: 2014 International Conference on High Performance Computing and Applications, ICHPCA 2014. https://doi.org/10.1109/ICHPCA.2014.7045347.
3. Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets In: Proceedings of the eleventh international conference on Information and knowledge management - CIKM ’02, 515. https://doi.org/10.1145/584792.584877.
4. Shah N, Mahajan S (2012) Document Clustering: A Detailed Review. Int J Appl Inf Syst (IJAIS) 4(5):30–38. URL https://doi.org/10.5120/8202-1598.
5. Bhardwaj S, Jain L, Jain S (2010) Cloud computing: A study of infrastructure as a service (iaas). Int J Eng Inf Technol 2(1):60–63.
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献