Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling-Reference-Cited by-同舟云学术

Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

Published:2020-11-05 Issue:11 Volume:11 Page:518
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Mustafa Mubashar,Zeng Feng^ORCID,Ghulam Hussain,Muhammad Arslan Hafiz

Abstract

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/11/11/518/pdf

Reference38 articles.

1. Multilingual Document Clustering Using Wikipedia as External Knowledge;Kumar,2011

2. Data clustering: 50 years beyond K-means

3. Mining Event-Oriented Topics in Microblog Stream with Unsupervised Multi-View Hierarchical Embedding

4. Central Topic Model for Event-oriented Topics Mining in Microblog Stream

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A review on semi-supervised clustering;Information Sciences;2023-06

2. Discovering Coherent Topics from Urdu Text: A Comparative Study of Statistical Models, Clustering Techniques and Word Embedding;2023 6th International Conference on Information and Computer Technologies (ICICT);2023-03

3. Optimized Feature Representation for Odia Document Clustering;Data Management, Analytics and Innovation;2023

4. Comparative analysis with topic modeling and word embedding methods after the Aegean Sea earthquake on Twitter;Evolving Systems;2022-07-23

5. Evaluation of clustering techniques on Urdu News head-lines: a case of short length text;Journal of Experimental & Theoretical Artificial Intelligence;2022-06-24