Abstract
AbstractWhereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts have not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially in complex genomes such as that of humans. To address this challenge, we here introduce GPN-MSA, a novel framework for DNA language models that leverages whole-genome sequence alignments across multiple species and takes only a few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, and OMIM) and population genomic data (gnomAD), our model for the human genome achieves outstanding performance on deleteriousness prediction for both coding and non-coding variants.
Publisher
Cold Spring Harbor Laboratory
Reference41 articles.
1. Human genome sequencing at the population scale: a primer on high-throughput DNA sequencing and analysis;American Journal of Epidemiology,2017
2. A guide for the diagnosis of rare and undiagnosed disease: beyond the exome;Genome Medicine,2022
3. Rare-Variant Association Analysis: Study Designs and Statistical Tests
4. From target discovery to clinical drug development with human genetics;Nature,2023
5. Joshua Meier , Roshan Rao , Robert Verkuil , Jason Liu , Tom Sercu , and Alex Rives . Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in Neural Information Processing Systems, 34, 2021.
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献