Author:
Borisova I.A.,Kutnenko O.A.
Abstract
The paper proposes a new approach in data censoring, which allows correcting diagnostic errors in the data sets in case when these samples are described in high-dimensional
feature spaces. Considering this case as a separate task is explained by the fact that in high-dimensional spaces most of the methods of outliers detection and data filtering,
both statistical and metric, stop working. At the same time, for the tasks of medical diagnostics, given the complexity of the objects and phenomena studied, a large number
of descriptive characteristics are the norm rather than the exception. To solve this problem, an approach that focuses on local similarity between objects belonging to
the same class and uses the function of rival similarity (FRiS function) as a measure of similarity has been proposed. In this approach for efficient data cleaning
from misclassified objects, the most informative and relevant low-dimensional feature subspace is selected, in which the separability of classes after their correction
will be maximal. Class separability here means the similarity of objects of one class to each other and their dissimilarity to objects of another class. Cleaning data from
class errors can consist both in their correction and removing the objects-outliers from the data set. The described method was implemented as a FRiS-LCFS algorithm
(FRiS Local Censoring with Feature Selection) and tested on model and real biomedical problems, including the problem of diagnosing prostate cancer based on DNA microarray analysis.
The developed algorithm showed its competitiveness in comparison with the standard methods for filtering data in high-dimensional spaces.
Publisher
Institute of Mathematical Problems of Biology of RAS (IMPB RAS)
Subject
Applied Mathematics,Biomedical Engineering
Reference23 articles.
1. Introduction to Statistical Data Editing and Imputation
2. Barnett V., Lewis T. Outliers in Statistical Data. Chichester: John Wiley and Sons; 1994. 584 p.
3. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data
4. Luca Greco. Robust Methods for Data Reduction Alessio Farcomeni. Chapman and Hall/CRC; 2015. 297 p.
5. Teng C.M. A comparison of noise handling techniques. In: Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference. 2001. P. 269–273.