Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents-Reference-Cited by-同舟云学术

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

Published:2016-03-08 Issue:3 Volume:15 Page:1-13
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Abuaiadah Diab¹

Affiliation:

1. Waikato Institute of Technology

Abstract

In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/2812809

Reference32 articles.

1. On the Impact of Dataset Characteristics on Arabic Document Classification

2. Towards an error-free Arabic stemming

3. F. Archetti P. Campanelli E. Fersini and E. Messina. 2006. A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means. Springer Berlin. F. Archetti P. Campanelli E. Fersini and E. Messina. 2006. A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means. Springer Berlin.

4. P. Berkhin. 2001. Survey of Clustering Data Mining Techniques. Retrieved from http://www.accrue.com/products/rp_cluster_review.pdf. P. Berkhin. 2001. Survey of Clustering Data Mining Techniques. Retrieved from http://www.accrue.com/products/rp_cluster_review.pdf.

5. Effect of ISRI Stemming on Similarity Measure for Arabic Document Clustering

Cited by 24 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Using unsupervised learning to classify inlet water for more stable design of water reuse in industrial parks;Water Science & Technology;2024-03-19

2. An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptions;International Journal on Document Analysis and Recognition (IJDAR);2024-03-05

3. Reading Scene Text with Aggregated Temporal Convolutional Encoder;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-11-20

4. Hybrid approach for text categorization: A case study with Bangla news article;Journal of Information Science;2023-06

5. Social determinants of health derived from people with opioid use disorder: Improving data collection, integration and use with cross-domain collaboration and reproducible, data-centric, notebook-style workflows;Frontiers in Medicine;2023-03-02