Efficient technique of microarray missing data imputation using clustering and weighted nearest neighbour-Reference-Cited by-同舟云学术

Efficient technique of microarray missing data imputation using clustering and weighted nearest neighbour

Published:2021-12 Issue:1 Volume:11 Page:
ISSN:2045-2322
Container-title:Scientific Reports
language:en
Short-container-title:Sci Rep

Author:

Dubey Aditya,Rasool Akhtar

Abstract

AbstractFor most bioinformatics statistical methods, particularly for gene expression data classification, prognosis, and prediction, a complete dataset is required. The gene sample value can be missing due to hardware failure, software failure, or manual mistakes. The missing data in gene expression research dramatically affects the analysis of the collected data. Consequently, this has become a critical problem that requires an efficient imputation algorithm to resolve the issue. This paper proposed a technique considering the local similarity structure that predicts the missing data using clustering and top K nearest neighbor approaches for imputing the missing value. A similarity-based spectral clustering approach is used that is combined with the K-means. The spectral clustering parameters, cluster size, and weighting factors are optimized, and after that, missing values are predicted. For imputing each cluster’s missing value, the top K nearest neighbor approach utilizes the concept of weighted distance. The evaluation is carried out on numerous datasets from a variety of biological areas, with experimentally inserted missing values varying from 5 to 25%. Experimental results prove that the proposed imputation technique makes accurate predictions as compared to other imputation procedures. In this paper, for performing the imputation experiments, microarray gene expression datasets consisting of information of different cancers and tumors are considered. The main contribution of this research states that local similarity-based techniques can be used for imputation even when the dataset has varying dimensionality and characteristics.

Publisher

Springer Science and Business Media LLC

Subject

Multidisciplinary

Link

https://www.nature.com/articles/s41598-021-03438-x.pdf

Reference35 articles.

1. Kurgan, L., Cios, K., Sontag, M., Accurso, F. & Frankatchden, A. Mining the cystic fibrosis data. Next Generation of Data-Mining Applications 415–444 (2005).

2. Lockhart, D. J. & Winzeleer, E. A. Genomics, gene expression and dna arrays. Nature 405, 827–836 (2000).

3. Saeys, Y., Inza, I. & Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007).

4. Moskon, M. & Mraz, M. Systematic approach to computational design of gene regulatory networks with information processing capabilities. IEEE/ACM Trans. Comput. Biol. Bioinf. 11, 431–440 (2014).

5. Chan, H., Tsui, S. & Mok, T. Data mining on dna sequences of hepatitis b virus. IEEE/ACM Trans. Comput. Biol. Bioinf. 8, 428–440 (2011).

Cited by 28 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimised multiple data partitions for cluster-wise imputation of missing values in gene expression data;Expert Systems with Applications;2024-12

2. CDRM: Causal disentangled representation learning for missing data;Knowledge-Based Systems;2024-09

3. Leveraging Quadratic Polynomials in Python for Advanced Data Analysis;F1000Research;2024-08-20

4. NNAWA: A Granular Nearest Neighbor Imputation Technique Based on Alpha-Weighted Average;2024 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE);2024-06-30

5. Revisiting the Problem of Missing Values in High-Dimensional Data and Feature Selection Effect;IFIP Advances in Information and Communication Technology;2024