Benchmarking imputation methods for discrete biological data-Reference-Cited by-同舟云学术

Benchmarking imputation methods for discrete biological data

Published:2023-04-07 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Gendre Matthieu,Hauffe Torsten^ORCID,Pimiento Catalina^ORCID,Silvestro Daniele^ORCID

Abstract

AbstractTrait datasets are at the basis of a large share of ecology and evolutionary research, being used to infer ancestral morphologies, to quantify species extinction risks, or to evaluate the functional diversity of biological communities. These datasets, however, are often plagued by missing data, for instance due to incomplete sampling limited data and resource availabilities. Several imputation methods exist to predict missing values and have been successfully evaluated and used to fill the gaps in datasets of quantitative traits. Here we explore the performance of different imputation methods on discrete biological traits i.e. qualitative or categorical traits such as diet or habitat. We develop a bioinformatics pipeline to impute trait data combining phylogenetic, machine learning, and deep learning methods while integrating a simulation framework to evaluate their performance on synthetic datasets. Using this pipeline we run a wide range of simulations under different missing rates, mechanisms, and biases and different evolutionary models. Our results indicate that a new ensemble approach, where we combined the imputation results of a selection of imputation methods provides the most robust and accurate prediction of missing discrete traits. We apply our pipeline to an incomplete trait dataset of 1015 elasmobranch species (including sharks and rays) and found a high imputation accuracy of the predictions based on an expert-based assessment of the missing traits. Our bioinformatic pipeline, implemented in an open-source R package, facilitates the application and comparison of multiple imputation methods to make robust predictions of missing trait values in biological datasets.

Publisher

Cold Spring Harbor Laboratory

Reference59 articles.

1. Identifying Hidden Rate Changes in the Evolution of a Binary Morphological Character: The Evolution of Plant Habit in Campanulid Angiosperms

2. TESTING FOR PHYLOGENETIC SIGNAL IN COMPARATIVE DATA: BEHAVIORAL TRAITS ARE MORE LABILE

3. rfishbase: exploring, manipulating and visualizing FishBase data from R

4. Combining parametric and non-parametric algorithms for a partially unsupervised classification of multitemporal remote-sensing images;Information Fusion,2002