Affiliation:
1. Department of Mathematical Engineering, 1Polytechnic University of Tirana, Faculty of Mathematical Engineering and Physics Engineering ALBANIA
Abstract
Cluster analysis is a statistical approach that identifies uniform clusters within data. The closeness of data is measured quantitatively using distance functions. Specifically for text data mining, clustering serves as a method of categorization of words based on the similarity of their occurrence within texts and classifying texts by topics or author. Hierarchical clustering is a powerful technique for identifying natural groupings within datasets, which can be especially useful for unsupervised text classification. This paper aims to utilize cluster analysis to establish Albanian texts clusters by authors. Using agglomerative hierarchical clustering we classify Albanian texts by authors according to the similarity of their word frequency. The similarity of texts is evaluated using cosine and Euclidean distances. Considering two study cases, respectively with and without Albanian stop words we conclude that the best clustering by authors of the Albanian documents is achieved with 87% accuracy using Ward’s method with cosine distance in the case of study by removing stop words.
Publisher
World Scientific and Engineering Academy and Society (WSEAS)
Reference22 articles.
1. [1] Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD. (2002)
2. [2] Aggarwal CC, Zhai C(2012) A survey of text clustering algorithms. Mining text data. Springer, New York, pp 77–128.
3. [3] Xia Y, Tang N, Hussain A, Cambria E (2015) Discriminative biterm topic model for headlinebased social news clustering. In: The twentyeighth international flairs conference, pp 311– 316
4. [4] Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: Proceedings of the 22nd international conference on World Wide Web. ACM, pp 1445–1456
5. [5] Saggion H, Poibeau T (2013) Automatic text summarization: past, present and future. Multisource, multilingual information extraction and summarization. Springer, New York, pp 3–21
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Diffusion Models for Image Generation to Enhance Health Literacy;2024 IEEE 12th International Conference on Healthcare Informatics (ICHI);2024-06-03