A real data-driven simulation strategy to select an imputation method for mixed-type trait data-Reference-Cited by-同舟云学术

A real data-driven simulation strategy to select an imputation method for mixed-type trait data

Published:2023-03-22 Issue:3 Volume:19 Page:e1010154
ISSN:1553-7358
Container-title:PLOS Computational Biology
language:en
Short-container-title:PLoS Comput Biol

Author:

May Jacqueline A.^ORCID,Feng Zeny,Adamowicz Sarah J.

Abstract

Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly complete information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.

Funder

Canada First Research Excellence Fund

University of Guelph

Natural Sciences and Engineering Research Council of Canada

Genome Canada and Ontario Genomics and by the Ontario Ministry of Economic Development, Job Creation and Trade

Publisher

Public Library of Science (PLoS)

Subject

Computational Theory and Mathematics,Cellular and Molecular Neuroscience,Genetics,Molecular Biology,Ecology,Modeling and Simulation,Ecology, Evolution, Behavior and Systematics

Reference80 articles.

1. Extreme lifespan of the human fish (Proteus anguinus): a challenge for ageing mechanisms.;Y Voituron;Biol Lett,2011

2. Global gradients of avian longevity support the classic evolutionary theory of ageing.;M Valcu;Ecography,2014

3. Amphibians over the edge: silent extinction risk of Data Deficient species.;SD Howard;Divers Distrib.,2014

4. Species’ traits influenced their response to recent climate change;M Pacifici;Nat Clim Change,2017

5. Nonrandom variation in within-species sample size and missing data in phylogenetic comparative studies;LZ Garamszegi;Syst Biol,2011

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Benchmarking imputation methods for categorical biological data;Methods in Ecology and Evolution;2024-07-24

2. Functional diversity metrics can perform well with highly incomplete data sets;Methods in Ecology and Evolution;2023-09-29