Albanian Text Classification: Bag of Words Model and Word Analogies-Reference-Cited by-同舟云学术

Albanian Text Classification: Bag of Words Model and Word Analogies

Published:2019-04-01 Issue:1 Volume:10 Page:74-87
ISSN:1847-9375
Container-title:Business Systems Research Journal
language:en
Short-container-title:

Author:

Kadriu Arbana¹,Abazi Lejla¹,Abazi Hyrije¹

Affiliation:

1. SEE University , Tetovo , Macedonia

Abstract

Abstract Background: Text classification is a very important task in information retrieval. Its objective is to classify new text documents in a set of predefined classes, using different supervised algorithms. Objectives: We focus on the text classification for Albanian news articles using two approaches. Methods/Approach: In the first approach, the words in a collection are considered as independent components, allocating to each of them a conforming vector in the vector’s space. Here we utilized nine classifiers from the scikit-learn package, training the classifiers with part of news articles (80%) and testing the accuracy with the remaining part of these articles. In the second approach, the text classification treats words based on their semantic and syntactic word similarities, supposing a word is formed by n-grams of characters. In this case, we have used the fastText, a hierarchical classifier, that considers local word order, as well as sub-word information. We have measured the accuracy for each classifier separately. We have also analyzed the training and testing time. Results: Our results show that the bag of words model does better than fastText when testing the classification process for not a large dataset of text. FastText shows better performance when classifying multi-label text. Conclusions: News articles can serve to create a benchmark for testing classification algorithms of Albanian texts. The best results are achieved with a bag of words model, with an accuracy of 94%.

Publisher

Walter de Gruyter GmbH

Subject

Management of Technology and Innovation,Economics, Econometrics and Finance (miscellaneous),Information Systems,Management Information Systems

Link

https://www.sciendo.com/pdf/10.2478/bsrj-2019-0006

Reference21 articles.

1. 1. Antonellis, I., Bouras, C., Poulopoulos, V. (2006), “Personalized news categorization through scalable text classification”, in Zhou, X., Li, J., Shen, H. T., Kitsuregawa, M., Zhang, Y. (Eds.) Frontiers of WWW Research and Development – APWeb 2006, Springer, Berlin, Heidelberg, pp. 391-401.

2. 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. (2017), “Enriching word vectors with subword information”, Transactions of the Association of Computational Linguistics, Vol. 5, pp.135-146.

3. 3. Chaudhari, S. V., Lade, S. (2013), “Classification of News and Research Articles Using Text Pattern Mining”, IOSR Journal of Computer Engineering (IOSR-JCE), Vol. 14, No. 5, pp. 120-126.

4. 4. Cortes, C., Vapnik, V. (1995), “Support-vector networks”, Machine Learning, Vol. 20, No. 3, pp. 273-297.

5. 5. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y., (2006), “Online passive-aggressive algorithms”, Journal of Machine Learning Research, Vol. 7, pp. 551-585.

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Using Cluster Analysis for Author Classification of Albanian Texts: A Study on the Effectiveness of Stop Words;WSEAS TRANSACTIONS ON COMPUTER RESEARCH;2023-10-19

2. Analysis of effective techniques and algorithms in terms of “text mining” to predict the authorship in Albanian language;CRJ;2023-09-18

3. Albanian Authorship Attribution Model;2023 12th Mediterranean Conference on Embedded Computing (MECO);2023-06-06

4. Spectral Analysis, Agglomerative, Mean Shift and Affinity Propagation Algorithms, Use on the Content from Social Media for Low-Resource Languages;2023 46th MIPRO ICT and Electronics Convention (MIPRO);2023-05-22

5. Systematic Literature Review of Information Extraction From Textual Data: Recent Methods, Applications, Trends, and Challenges;IEEE Access;2023