Cleaning Data Sets with Diagnostic Errors in the High-Dimensional Feature Spaces-Reference-Cited by-同舟云学术

Cleaning Data Sets with Diagnostic Errors in the High-Dimensional Feature Spaces

Published:2019-10-07 Issue:2 Volume:14 Page:464-476
ISSN:1994-6538
Container-title:Mathematical Biology and Bioinformatics
language:
Short-container-title:Math.Biol.Bioinf.

Author:

Borisova I.A.,Kutnenko O.A.

Abstract

The paper proposes a new approach in data censoring, which allows correcting diagnostic errors in the data sets in case when these samples are described in high-dimensional feature spaces. Considering this case as a separate task is explained by the fact that in high-dimensional spaces most of the methods of outliers detection and data filtering, both statistical and metric, stop working. At the same time, for the tasks of medical diagnostics, given the complexity of the objects and phenomena studied, a large number of descriptive characteristics are the norm rather than the exception. To solve this problem, an approach that focuses on local similarity between objects belonging to the same class and uses the function of rival similarity (FRiS function) as a measure of similarity has been proposed. In this approach for efficient data cleaning from misclassified objects, the most informative and relevant low-dimensional feature subspace is selected, in which the separability of classes after their correction will be maximal. Class separability here means the similarity of objects of one class to each other and their dissimilarity to objects of another class. Cleaning data from class errors can consist both in their correction and removing the objects-outliers from the data set. The described method was implemented as a FRiS-LCFS algorithm (FRiS Local Censoring with Feature Selection) and tested on model and real biomedical problems, including the problem of diagnosing prostate cancer based on DNA microarray analysis. The developed algorithm showed its competitiveness in comparison with the standard methods for filtering data in high-dimensional spaces.

Publisher

Institute of Mathematical Problems of Biology of RAS (IMPB RAS)

Subject

Applied Mathematics,Biomedical Engineering

Reference23 articles.

1. Introduction to Statistical Data Editing and Imputation

2. Barnett V., Lewis T. Outliers in Statistical Data. Chichester: John Wiley and Sons; 1994. 584 p.

3. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data

4. Luca Greco. Robust Methods for Data Reduction Alessio Farcomeni. Chapman and Hall/CRC; 2015. 297 p.

5. Teng C.M. A comparison of noise handling techniques. In: Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference. 2001. P. 269–273.