Algorithms for the classification of text documents, taking into account proximity in the attribute space

Author:

Zhaksybaev Darhan1,Bakiev Murat1

Affiliation:

1. L.N. Gumilev Eurasian National University

Abstract

Text classification is one of the key issues in text development research, where documents are classified based on information under supervision. Since there is a considerable number of text classification algorithms, it is currently necessary to compile an overview list of them in order to simplify the orientation in the classification tools that are available at the moment. Many text representation schemes and classification/learning algorithms used to classify text documents into predefined categories can be found in the literature, but some of them require detailed analysis and unleashed potential. The purpose of this study is to provide an overview of different text presentation schemes and a comparison of different classifiers that are used to classify text documents into predefined categories. During the study, a comparison method was used as part of the methodology – modern classification approaches based on criteria, algorithms used and time complexity were compared, as well as methods of analysis, modelling and combination. As a result of the study, several algorithms or combinations of algorithms have been proposed for automatic classification of documents as hybrid approaches. The SVM (Support Vector Machine) classifier was recognised as one of the most effective text classification methods when comparing guided machine learning algorithms. It was concluded that SVM captures the inherent characteristics of the data and embeds the structural risk minimisation (SRM) principle, which minimises the upper bound of the generalisation error better than the empirical risk minimisation principle.

Publisher

Infra-M Academic Publishing House

Subject

General Medicine

Reference26 articles.

1. Добрынин, В.Ю. Теория информационно-логических систем. Информационный поиск : методические указания к курсу информационного поиска / В.Ю. Добрынин. – Санкт-Петербург, 2002. – 36 с., Dobrynin, V.Yu. Teoriya informacionno-logicheskih sistem. Informacionnyy poisk : metodicheskie ukazaniya k kursu informacionnogo poiska / V.Yu. Dobrynin. – Sankt-Peterburg, 2002. – 36 s.

2. Дубинский, А.Г. Характеристики эффективности информационного поиска в сети Интернет / А.Г. Дубинский // Научный сервис в сети Интернет : сборник тезисов докладов Всероссийской научной конференции. – М. : Изд-во МГУ, 2001. – С. 136-138., Dubinskiy, A.G. Harakteristiki effektivnosti informacionnogo poiska v seti Internet / A.G. Dubinskiy // Nauchnyy servis v seti Internet : sbornik tezisov dokladov Vserossiyskoy nauchnoy konferencii. – M. : Izd-vo MGU, 2001. – S. 136-138.

3. Joshi, S.C. Information technology, internet use, and adolescent cognitive development / S.C. Joshi, G. Rose // 3rd International Conference on Computational Systems and Information Technology for Sustainable Solutions. – Bengaluru: Institute of Electrical and Electronics Engineers Inc., 2018. – Pp. 22-28. – DOI: 10.1109/CSITSS.2018.8768780., Joshi, S.C. Information technology, internet use, and adolescent cognitive development / S.C. Joshi, G. Rose // 3rd International Conference on Computational Systems and Information Technology for Sustainable Solutions. – Bengaluru: Institute of Electrical and Electronics Engineers Inc., 2018. – Pp. 22-28. – DOI: 10.1109/CSITSS.2018.8768780.

4. Когаловский, М.Р. Перспективные технологии информационных систем / М.Р. Когаловский. – М. : ДМК Пресс, 2018. – 288 c/, Kogalovskiy, M.R. Perspektivnye tehnologii informacionnyh sistem / M.R. Kogalovskiy. – M. : DMK Press, 2018. – 288 c/

5. Dhar, A. Efficient feature selection based on modified cuckoo search optimization problem for classifying web text documents / A. Dhar, N.S. Dash, K. Roy // Communications in Computer and Information Science. – 2019. – Vol. 1037. – Pp. 640-651. – DOI: 10.1007/978-981-13-9187-3_57., Dhar, A. Efficient feature selection based on modified cuckoo search optimization problem for classifying web text documents / A. Dhar, N.S. Dash, K. Roy // Communications in Computer and Information Science. – 2019. – Vol. 1037. – Pp. 640-651. – DOI: 10.1007/978-981-13-9187-3_57.

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3