Affiliation:
1. National Research University «Moscow Power Engineering Institute»
Abstract
Binary classifiers are studies on balanced text samples. The samplings are formed from scientific publications in the field of Computer Science (Computer Science). The first class contains articles on «Text Data Mining» (the «TDM» class), the second one contains works on other topics of Computer Science (the «non-TDM» class). All the main stages of preliminary processing of text documents are considered, models of their presentation are analyzed. The problem of binary classification is formulated and the quality indicators used in the study are given. A method of sampling from the Russian digital library (Elibrary) is proposed. The generated sampling consists of bibliographic descriptions of documents (title, abstract and keywords). An exploratory analysis was carried out and the sampling structure was studied. «Term clouds» for two classes are constructed and analyzed, documents are visualized using the method of stochastic embedding of neighbors with t-distribution (t-SNE). Based on the review and analysis of known classifiers, the following methods were selected for the study: the K-nearest neighbor method, random forest, gradient boosting, logistic regression, and the support vector method. Profile methods based on the construction of a vector (profile) of the most informative terms determined by the frequency of occurrence of terms and classes are also used in the study. The parameters of the methods were configured using a five-fold cross-validation. The best quality of classification in our sampling demonstrated the methods using the ensemble (collective) decision-making principle (random forest, gradient boosting), as well as the support vector method. The best classifier, gradient boosting, had the proportion of correct answers (accuracy) about 0.98, recall and precision about 0.99. The other (simpler) methods used in the study also generally showed rather good quality of classification (for the least accurate k-nearest neighbor method accuracy, recall and precision were 0.90, 0.81, and 0.91, respectively).
Reference23 articles.
1. Evangeline M., Shyamala K. Text Categorization Techniques: A Survey / International Conference on Innovative Practices in Technology and Management (ICIPTM), 2021. P. 137 – 142.
2. Surya K., Nithin R., Prasanna S., Venkatesan R. A comprehensive study on machine learning concepts for text mining / International Conference on Circuit, Power and Computing Technologies (ICCPCT), 2016. P. 1 – 5.
3. Manning K., Raghavan P., Schutze H. Introduction to information retrieval. — Moscow: Vil’yams, 2014. — 528 p. [Russian translation].
4. Flakh P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. — Moscow: DMK-press, 2015. — 400 p. [in Russian]
5. Orlov A. I. Three main results of the mathematical theory of classification / Zavod. Lab. Diagn. Mater. 2016. Vol. 82. N 5. P. 63 – 70 [in Russian].
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献