Multi-Class Document Classification Using Lexical Ontology-Based Deep Learning
-
Published:2023-05-17
Issue:10
Volume:13
Page:6139
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Yelmen Ilkay12ORCID, Gunes Ali1ORCID, Zontul Metin3ORCID
Affiliation:
1. Department of Computer Engineering, Faculty of Engineering, Istanbul Aydin University, Istanbul 34295, Turkey 2. Turkcell Group Company Digital Educational Technologies Inc., Ankara 06800, Turkey 3. Department of Computer Engineering, Faculty of Engineering and Natural Sciences, Sivas Science and Technology University, Sivas 58100, Turkey
Abstract
With the recent growth of the Internet, the volume of data has also increased. In particular, the increase in the amount of unstructured data makes it difficult to manage data. Classification is also needed in order to be able to use the data for various purposes. Since it is difficult to manually classify the ever-increasing volume data for the purpose of various types of analysis and evaluation, automatic classification methods are needed. In addition, the performance of imbalanced and multi-class classification is a challenging task. As the number of classes increases, so does the number of decision boundaries a learning algorithm has to solve. Therefore, in this paper, an improvement model is proposed using WordNet lexical ontology and BERT to perform deeper learning on the features of text, thereby improving the classification effect of the model. It was observed that classification success increased when using WordNet 11 general lexicographer files based on synthesis sets, syntactic categories, and logical groupings. WordNet was used for feature dimension reduction. In experimental studies, word embedding methods were used without dimension reduction. Afterwards, Random Forest (RF), Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP) algorithms were employed to perform classification. These studies were then repeated with dimension reduction performed by WordNet. In addition to the machine learning model, experiments were also conducted with the pretrained BERT model with and without WordNet. The experimental results showed that, on an unstructured, seven-class, imbalanced dataset, the highest accuracy value of 93.77% was obtained when using our proposed model.
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference69 articles.
1. Survey on supervised machine learning techniques for automatic text classification;Kadhim;Artif. Intell. Rev.,2019 2. Survey on Feature Selection Techniques and Classification Algorithms for Efficient Text Classification;Kumbhar;Int. J. Sci. Res.,2016 3. A Review on Feature Selection Methods for Classification Tasks;Mwadulo;Int. J. Comput. Appl. Technol. Res.,2016 4. Zhang, T., and Yang, B. (2016, January 18–20). Big data dimension reduction using PCA. Proceedings of the 2016 IEEE International Conference on Smart Cloud (SmartCloud), New York, NY, USA. 5. Lu, Z., Du, P., and Nie, J.Y. (2020). Advances in Information Retrieval, Proceedings of the 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, 14–17 April 2020, Springer.
|
|