Author:
Ramasubramanian Abhirami,Sunderam Uma,Srinivasan Rajgopal
Abstract
AbstractSynonymous mutations can have a deleterious effect leading to disease, even though they are not protein altering. Variations at genomic sites leading to synonymous variants are frequently highly conserved across species. Several prediction methods have been developed to assess the impact of synonymous mutations and are highly dependent on having validated sets of both deleterious and benign synonymous mutations. However, validated data available for deleterious synonymous mutations is sparse unlike for missense mutations. Rather than develop a model for predicting pathogenicity of synonymous variants, we seek to understand the relative importance of various factors that lead to conservation at sites of synonymous variants. Our study built machine learning models using various features on a large set of reported and generated synonymous variants (Zeng Z et al, 2019) to predict conservation (Genomic Evolutionary Rate Profiling – Rejected Substitution (GERP RS) base scores and Phylogenetic p-values for 100 vertebrates (PP100)) at genomic sites. We used the extreme gradient boosting classifier to classify sites as high, medium and low conservation at different cutoffs. Our experiments report an AUC between 0.74-0.79 and the sensitivity was significant. Of the features we explored, a few alternate allele independent properties were repeatedly flagged as having high impact. These findings provide information for predictors to further improve models for synonymous variant impact.
Publisher
Cold Spring Harbor Laboratory