Abstract
AbstractBackgroundGenomic testing such as exome sequencing and genome sequencing is being widely utilized for diagnosing rare Mendelian disorders. Because of a large number of variants identified by these tests, interpreting the final list of variants and identifying the disease-causing variant even after filtering out likely benign variants could be labor-intensive and time-consuming. It becomes even more burdensome when various variant types such as structural variants need to be considered simultaneously with small variants. One way to accelerate the interpretation process is to have all variants accurately prioritized so that the most likely diagnostic variant(s) are clearly distinguished from the rest.MethodsTo comprehensively predict the genomic test results, we developed a deep learning based variant prioritization system that leverages multiple instance learning and feeds multiple variant types for variant prioritization. We additionally adopted learning to rank (LTR) for optimal prioritization. We retrospectively developed and validated the model with 5-fold cross-validation in 23,115 patients with suspected rare diseases who underwent whole exome sequencing. Furthermore, we conducted the ablation test to confirm the effectiveness of LTR and the importance of permutational features for model interpretation. We also compared the prioritization performance to publicly available variant prioritization tools.ResultsThe model showed an average AUROC of 0.92 for the genomic test results. Further, the model had a hit rate of 96.8% at 5 when prioritizing single nucleotide variants (SNVs)/small insertions and deletions (INDELs) and copy number variants (CNVs) together, and a hit rate of 95.0% at 5 when prioritizing CNVs alone. Our model outperformed publicly available variant prioritization tools for SNV/INDEL only. In addition, the ablation test showed that the model using LTR significantly outperformed the baseline model that does not use LTR in variant prioritization (p=0.007).ConclusionA deep learning model leveraging multiple instance learning precisely predicted genetic testing conclusion while prioritizing multiple types of variants. This model is expected to accelerate the variant interpretation process in finding the disease-causing variants more quickly for rare genetic diseases.
Publisher
Cold Spring Harbor Laboratory