Abstract
AbstractComputational methods for identifying gene–disease associations can use both genomic and phenotypic information to prioritize genes and variants that may be associated with genetic diseases. Phenotype-based methods commonly rely on comparing phenotypes observed in a patient with a database of genotype-to-phenotype associations using a measure of semantic similarity, and are primarily limited by the quality and completeness of this database as well as the quality of phenotypes assigned to a patient. Genotype-to-phenotype associations used by these methods are largely derived from literature and coded using phenotype ontologies. Large Language Models (LLMs) have been trained on large amounts of text and have shown their potential to answer complex questions across multiple domains. Here, we demonstrate that LLMs can prioritize disease-associated genes as well, or better than, dedicated bioinformatics methods relying on calculated phenotype similarity. The LLMs use only natural language information as background knowledge and do not require ontology-based phenotyping or structured genotype-to-phenotype knowledge. We use a cohort of undiagnosed patients with rare diseases and show that LLMs can be used to provide diagnostic support that helps in identifying plausible candidate genes.
Publisher
Cold Spring Harbor Laboratory