Abstract
Machine learning (ML) is increasingly being used to guide biological discovery in biomedicine such as prioritizing promising small molecules in drug discovery. In those applications, ML models are used to predict the properties of biological systems, and researchers use these predictions to prioritize candidates as new biological hypotheses for downstream experimental validations. However, when applied to unseen situations, these models can be overconfident and produce a large number of false positives. One solution to address this issue is to quantify the model’s prediction uncertainty and provide a set of hypotheses with a controlled false discovery rate (FDR) pre-specified by researchers. We propose CPEC, an ML framework for FDR-controlled biological discovery. We demonstrate its effectiveness using enzyme function annotation as a case study, simulating the discovery process of identifying the functions of less-characterized enzymes. CPEC integrates a deep learning model with a statistical tool known as conformal prediction, providing accurate and FDR-controlled function predictions for a given protein enzyme. Conformal prediction provides rigorous statistical guarantees to the predictive model and ensures that the expected FDR will not exceed a user-specified level with high probability. Evaluation experiments show that CPEC achieves reliable FDR control, better or comparable prediction performance at a lower FDR than existing methods, and accurate predictions for enzymes under-represented in the training data. We expect CPEC to be a useful tool for biological discovery applications where a high yield rate in validation experiments is desired but the experimental budget is limited.
Funder
National Institute of General Medical Sciences
Amazon
University of Illinois at Urbana-Champaign
Publisher
Public Library of Science (PLoS)
Reference41 articles.
1. Functional genomic hypothesis generation and experimentation by a robot scientist;RD King;Nature,2004
2. A deep learning approach to antibiotic discovery;JM Stokes;Cell,2020
3. ECNet is an evolutionary context-integrated deep learning framework for protein engineering;Y Luo;Nature communications,2021
4. Adaptive machine learning for protein engineering;BL Hie;Current opinion in structural biology,2022
5. Rethinking drug design in the artificial intelligence era;P Schneider;Nature Reviews Drug Discovery,2020
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献