Affiliation:
1. Computer Engineering Department, Imam Khomeini International University, Qazvin, Iran
2. Departmnet of Electrical engineering, raja University, Qazvin, Iran
Abstract
Increasing size of text data in databases requires appropriate classification
and analysis in order to acquire knowledge and improve the quality of
decision-making in organizations. The process of discovering the hidden
patterns in the data set, called data mining, requires access to quality
data in order to receive a valid response from the system. Detecting and
removing anomalous data is one of the pre-processing steps and cleaning data
in this process. Methods for anomalous data detection are generally
classified into three groups including supervised, semi-supervised, and
unsupervised. This research tried to offer an unsupervised approach for
spotting the anomalous data in text collections. In the proposed method, a
combination of two approaches (i.e., clustering-based and distance-based) is
used for detecting anomaly in the text data. In order to evaluate the
efficiency of the proposed approach, this method is applied on four labeled
data sets. The accuracy of Na?ve Bayes classification algorithms and
decision tree are compared before and after removal of anomalous data with
the proposed method and some other methods such as Density-based spatial
clustering of applications with noise (DBSCAN). Our proposed method shows
that accuracy of more than 92.39% can be achieved. In general, the results
revealed that in most cases the proposed method has a good performance.
Publisher
National Library of Serbia
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献