Abstract
AbstractTrait datasets are at the basis of a large share of ecology and evolutionary research, being used to infer ancestral morphologies, to quantify species extinction risks, or to evaluate the functional diversity of biological communities. These datasets, however, are often plagued by missing data, for instance due to incomplete sampling limited data and resource availabilities. Several imputation methods exist to predict missing values and have been successfully evaluated and used to fill the gaps in datasets of quantitative traits. Here we explore the performance of different imputation methods on discrete biological traits i.e. qualitative or categorical traits such as diet or habitat. We develop a bioinformatics pipeline to impute trait data combining phylogenetic, machine learning, and deep learning methods while integrating a simulation framework to evaluate their performance on synthetic datasets. Using this pipeline we run a wide range of simulations under different missing rates, mechanisms, and biases and different evolutionary models. Our results indicate that a new ensemble approach, where we combined the imputation results of a selection of imputation methods provides the most robust and accurate prediction of missing discrete traits. We apply our pipeline to an incomplete trait dataset of 1015 elasmobranch species (including sharks and rays) and found a high imputation accuracy of the predictions based on an expert-based assessment of the missing traits. Our bioinformatic pipeline, implemented in an open-source R package, facilitates the application and comparison of multiple imputation methods to make robust predictions of missing trait values in biological datasets.
Publisher
Cold Spring Harbor Laboratory