Abstract
ABSTRACTThe promise of using machine learning (ML) to extract scientific insights from high dimensional datasets is tempered by the frequent presence of confounding variables, and it behooves scientists to determine whether or not a model has extracted the desired information or instead may have fallen prey to bias. Due both to features of many natural phenomena and to practical constraints of experimental design, complex bioscience datasets tend to be organized in nested hierarchies which can obfuscate the origin of a confounding effect and obviate traditional methods of confounder amelioration. We propose a simple non-parametric statistical method called the Rank-to-Group (RTG) score that can identify hierarchical confounder effects in raw data and ML-derived data embeddings. We show that RTG scores correctly assign the effects of hierarchical confounders in cases where linear methods such as regression fail. In a large public biomedical image dataset, we discover unreported effects of experimental design. We then use RTG scores to discover cross-modal correlated variability in a complex multi-phenotypic biological dataset. This approach should be of general use in experiment–analysis cycles and to ensure confounder robustness in ML models.
Publisher
Cold Spring Harbor Laboratory
Reference25 articles.
1. Chained Regularization for Identifying Brain Patterns Specific to HIV Infection;NeuroImage,2018
2. Aschengrau, Ann , and George R. Seage . 2013. Essentials of Epidemiology in Public Health. Jones & Bartlett Learning.
3. Carcamo-Orive, Ivan , Gabriel E. Hoffman , Paige Cundiff , Noam D. Beckmann , Sunita L. D’Souza , Joshua W. Knowles , Achchhe Patel , et al. 2017. “Analysis of Transcriptional Variability in a Large Human iPSC Library Reveals Genetic and Non-Genetic Determinants of Heterogeneity.”Cell Stem Cell. https://doi.org/10.1016/j.stem.2016.11.005.
4. Chopin, Nicolas , Sébastien Gadat , Benjamin Guedj , Arnaud Guyader , and Elodie Vernet . 2015. “On Some Recent Advances on High Dimensional Bayesian Statistics.” ESAIM:Proceedings and Surveys. https://doi.org/10.1051/proc/201551016.
5. Cuccarese, Michael F. , Berton A. Earnshaw , Katie Heiser , Ben Fogelson , Chadwick T. Davis , Peter F. McLean , Hannah B. Gordon , et al. 2020. “Functional Immune Mapping withDeep-Learning Enabled Phenomics Applied to Immunomodulatory and COVID-19 Drug Discovery.” bioRxiv. https://doi.org/10.1101/2020.08.02.233064.