Abstract
AbstractPurposePrior studies demonstrate the significance of specific cis-regulatory variants in retinal disease, however determining the functional impact of regulatory variants remains a major challenge. In this study, we utilize a machine learning approach, trained on epigenomic data from the adult human retina, to systematically quantify the predicted impact of cis-regulatory variants.MethodsWe used human retinal DNA accessibility data (ATAC-seq) to determine a set of 18.9k high-confidence putative cis-regulatory elements. 80% of these elements were used to train a machine learning model utilizing a gapped k-mer support vector machine-based approach. In silico saturation mutagenesis and variant scoring was applied to predict the functional impact of all potential single nucleotide variants within cis-regulatory elements. Impact scores were tested in a 20% hold-out dataset and compared to allele population frequency, phylogenetic conservation, transcription factor (TF) binding motifs, and existing massively parallel reporter assay (MPRA) data.ResultsWe generated a model that distinguishes between human retinal regulatory elements and negative test sequences with 95% accuracy. Among a hold-out test set of 3.7k human retinal CREs, all possible single nucleotide variants (SNVs) were scored. Variants with negative impact scores correlated with reduced population allele frequency, higher phylogenetic conservation of the reference allele, disruption of predicted TF binding motifs, and massively-parallel reporter expression.ConclusionsWe demonstrated the utility of human retinal epigenomic data to train a machine learning model for the purpose of predicting the impact of non-coding regulatory sequence variants. Our model accurately scored sequences and predicted putative transcription factor binding motifs. This approach has the potential to expedite the characterization of pathogenic non-coding sequence variants in the context of unexplained retinal disease.
Publisher
Cold Spring Harbor Laboratory