K-NN’S NEAREST NEIGHBORS METHOD FOR CLASSIFYING TEXT DOCUMENTS BY THEIR TOPICS

Author:

Boyko N. I.,Mykhailyshyn V. Yu.

Abstract

Context. Optimization of the method of nearest neighbors k-NN for the classification of text documents by their topics and experimentally solving the problem based on the method. Objective. The study aims to study the method of nearest neighbors k-NN for classifying text documents by their topics. The task of the study is to classify text documents by their topics based on a dataset for the optimal time and with high accuracy. Method. The k-nearest neighbors (k-NN) method is a metric algorithm for automatic object classification or regression. The k-NN algorithm stores all existing data and categorizes the new point based on the distance between the new point and all points in the training set. For this, a certain distance metric, such as Euclidean distance, is used. In the learning process, k-NN stores all the data from the training set, so it belongs to the “lazy” algorithms since learning takes place at the time of classification. The algorithm makes no assumptions about the distribution of data and it is nonparametric. The task of the k-NN algorithm is to assign a certain category to the test document x based on the categories k of the nearest neighbors from the training dataset. The similarity between the test document x and each of the closest neighbors is scored by the category to which the neighbor belongs. If several of k’s closest neighbors belong to the same category, then the similarity score of that category for the test document x is calculated as the sum of the category scores for each of these closest neighbors. After that, the categories are ranked by score, and the test document is assigned to the category with the highest score. Results. The k-NN method for classifying text documents has been successfully implemented. Experiments have been conducted with various methods that affect the efficiency of k-NN, such as the choice of algorithm and metrics. The results of the experiments showed that the use of certain methods can improve the accuracy of classification and the efficiency of the model. Conclusions. Displaying the results on different metrics and algorithms showed that choosing a particular algorithm and metric can have a significant impact on the accuracy of predictions. The application of the ball tree algorithm, as well as the use of different metrics, such as Manhattan or Euclidean distance, can lead to improved results. Using clustering before applying k-NN has been shown to have a positive effect on results and allows for better grouping of data and reduces the impact of noise or misclassified points, which leads to improved accuracy and class distribution.

Publisher

National University "Zaporizhzhia Polytechnic"

Subject

General Medicine

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3