Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data-Reference-Cited by-同舟云学术

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Published:2023-01-26 Issue:1 Volume:24 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Muñiz Moreno Maria del Mar,Gavériaux-Ruff Claire,Herault Yann^ORCID

Abstract

Abstract Background In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians. Result We present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier’s predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes. Conclusions Gdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen, together with vignettes, documentation for the functions and examples to guide you in each own implementation.

Funder

Agence Nationale de la recherche

Horizon 2020 Framework Programme

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s12859-022-05111-0.pdf

Reference31 articles.

1. Serdar CC, Cihan M, Yücel D, Serdar MA. Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochem Med (Zagreb). 2021;31(1):010502.

2. Faber J, Fonseca LM. How sample size influences research outcomes. Dental Press J Orthod. 2014;19(4):27–9.

3. Barbour DL. Precision medicine and the cursed dimensions. NPJ Digit Med. 2019;2(1):4.

4. Vanhoeyveld J, Martens D. Imbalanced classification in sparse and large behaviour datasets. Data Min Knowl Discov. 2018;32(1):25–82.

5. de Cnudde S, Ramon Y, Martens D, Provost F. Deep learning on big, sparse. Behav Data Big Data. 2019;7(4):286–307.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Shaping down syndrome brain cognitive and molecular changes due to aging using adult animals from the Ts66Yah murine model;Neurobiology of Disease;2024-06

2. Cognitive and molecular characterization of the Ts66Yah murine model of Down syndrome: deepening on hippocampal changes associated with genotype and aging;2024-01-02