Author:
Jararweh Ala,Macaulay Oladimeji,Arredondo David,Oyebamiji Olufunmilola M,Hu Yue,Tafoya Luis,Zhang Yanfu,Virupakshappa Kushal,Sahu Avinash
Abstract
AbstractRepresentation learning approaches leverage sequence, expression, and network data, but utilize only a fraction of the rich textual knowledge accumulated in the scientific literature. We present LitGene, an interpretable transformer-based model that refines gene representations by integrating textual information. The model is enhanced through a Contrastive Learning (CL) approach that identifies semantically similar genes sharing a Gene Ontology (GO) term. LitGene demonstrates accuracy across eight benchmark predictions of protein properties and robust zero-shot learning capabilities, enabling the prediction of new potential disease risk genes in obesity, asthma, hypertension, and schizophrenia. LitGene’s SHAP-based interpretability tool illuminates the basis for identified disease-gene associations. An automated statistical framework gauges literature support for AI biomedical predictions, providing validation and improving reliability. LitGene’s integration of textual and genetic information mitigates data biases, enhances biomedical predictions, and promotes ethical AI practices by ensuring transparent, equitable, open, and evidence-based insights. LitGene code is open source and also available for use via a public web interface atlitgene.avisahuai.com.
Publisher
Cold Spring Harbor Laboratory