Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes-Reference-Cited by-同舟云学术

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Published:2008-01-10 Issue:1 Volume:9 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Brock Guy N,Shaffer John R,Blakesley Richard E,Lotz Meredith J,Tseng George C

Abstract

Abstract Background Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. Results We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Conclusion Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-9-12.pdf

Reference31 articles.

1. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17(6):520–525. 10.1093/bioinformatics/17.6.520

2. Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003, 19(16):2088–2096. 10.1093/bioinformatics/btg287

3. Sehgal MS, Gondal I, Dooley LS: Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 2005, 21(10):2417–2423. 10.1093/bioinformatics/bti345

4. Gan X, Liew AW, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res 2006, 34(5):1608–1619. 10.1093/nar/gkl047

5. Tuikkala J, Elo L, Nevalainen OS, Aittokallio T: Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2006, 22(5):566–572. 10.1093/bioinformatics/btk019

Cited by 78 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Semi-Supervised Learning in Bioinformatics;Reference Module in Life Sciences;2024

2. Big Data in Gastroenterology Research;International Journal of Molecular Sciences;2023-01-27

3. Computational approaches for predicting variant impact: An overview from resources, principles to applications;Frontiers in Genetics;2022-09-29

4. Development and Validation of a New Multiparametric Random Survival Forest Predictive Model for Breast Cancer Recurrence with a Potential Benefit to Individual Outcomes;Cancer Management and Research;2022-03

5. Latent triple trajectories of substance use as predictors for the onset of antisocial personality disorder among urban African American and Puerto Rican adults: A 22-year longitudinal study;Substance Abuse;2022-01-24