Affiliation:
1. Department of Biostatistics and Epidemiology University of Massachusetts–Amherst Amherst Massachusetts USA
2. Department of Biostatistics Harvard School of Public Health Boston Massachusetts USA
3. Channing Division of Network Medicine, Department of Medicine Brigham and Women's Hospital, Harvard Medical School Boston Massachusetts USA
4. Division of Biostatistics University of California–Berkeley Berkeley California USA
5. Division of Biostatistics, Department of Preventive Medicine Northwestern University Feinberg School of Medicine Chicago Illinois USA
Abstract
AbstractGaussian graphical models (GGMs) are a popular form of network model in which nodes represent features in multivariate normal data and edges reflect conditional dependencies between these features. GGM estimation is an active area of research. Currently available tools for GGM estimation require investigators to make several choices regarding algorithms, scoring criteria, and tuning parameters. An estimated GGM may be highly sensitive to these choices, and the accuracy of each method can vary based on structural characteristics of the network such as topology, degree distribution, and density. Because these characteristics area prioriunknown, it is not straightforward to establish universal guidelines for choosing a GGM estimation method. We address this problem by introducing SpiderLearner, an ensemble method that constructs a consensus network from multiple estimated GGMs. Given a set of candidate methods, SpiderLearner estimates the optimal convex combination of results from each method using a likelihood‐based loss function. ‐fold cross‐validation is applied in this process, reducing the risk of overfitting. In simulations, SpiderLearner performs better than or comparably to the best candidate methods according to a variety of metrics, including relative Frobenius norm and out‐of‐sample likelihood. We apply SpiderLearner to publicly available ovarian cancer gene expression data including 2013 participants from 13 diverse studies, demonstrating our tool's potential to identify biomarkers of complex disease. SpiderLearner is implemented as flexible, extensible, open‐source code in the R packageensembleGGMathttps://github.com/katehoffshutta/ensembleGGM.
Funder
U.S. National Library of Medicine
Subject
Statistics and Probability,Epidemiology
Reference71 articles.
1. Graphical Models
2. Sparse inverse covariance estimation with the graphical lasso
3. Model selection and estimation in the Gaussian graphical model
4. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data;Banerjee O;J Mach Learn Res,2008