Abstract
AbstractBackgroundDistinguishing diseases into distinct subtypes is crucial for study, effective treatment, and the discovery of potential cures. The Open Targets Platform integrates biomedical, genetic, and biochemical datasets with the goal of empowering disease ontologies and gene targets.However, many disease annotations remain incomplete, necessitating laborious expert medical input. This is particularly painful for rare and orphan diseases, where resources are limited.ResultsWe present a machine learning approach to identifying diseases with potential subtypes, using the approximately 23,000 diseases documented in Open Targets. We derive and describe novel features for predicting diseases with subtypes, using direct evidence. Machine learning models were applied to analyze feature importance and evaluate predictive performance for discovering known subtypes. Our model achieves a high (89.1%) ROCAUC. We integrated pre-trained deep learning language models and showed their benefits. Furthermore, we identify 515 disease candidates predicted to possess previously unannotated subtypes.ConclusionsOur models can partition diseases into distinct subtypes. This methodology enables a robust, scalable approach for improving knowledge-based annotations and a comprehensive assessment of disease ontology tiers. Our candidates are attractive targets for further study and personalized medicine, potentially aiding in the unveiling of new therapeutic indications for sought-after targets.
Publisher
Cold Spring Harbor Laboratory