Abstract
AbstractAimsMonogenic diabetes is characterized as a group of diseases caused by rare variants in single genes. Multiple genes have been described to be responsible for monogenic diabetes, but the information on the variants is not unified among different resources. In this work, we aimed to develop an automated pipeline that collects all the genetic variants associated with monogenic diabetes from different resources, unify the data and translate the genetic sequences to the proteins.MethodsThe pipeline developed in this work is written in Python with the use of Jupyter notebook. It consists of 6 modules that can be implemented separately. The translation step is performed using the ProVar tool also written in Python. All the code along with the intermediate and final results is available for public access and reuse.ResultsThe resulting database had 2701 genomic variants in total and was divided into two levels: the variants reported to have an association with monogenic diabetes and the variants that have evidence of pathogenicity. Of them, 2565 variants were found in the ClinVar database and the rest 136 were found in the literature showing that the overlap between resources is not absolute.ConclusionsWe have developed an automated pipeline for collecting and harmonizing data on genetic variants associated with monogenic diabetes. Furthermore, we have translated variant genetic sequences into protein sequences accounting for all protein isoforms and their variants. This allows researchers to consolidate information on variant genes and proteins associated with monogenic diabetes and facilitates their study using proteomics or structural biology. Our open and flexible implementation using Jupyter notebooks enables tailoring and modifying the pipeline and its application to other rare diseases.Research in contextMonogenic diabetes is a group of Mendelian diseases with an autosomal-dominant pattern of inheritance.Monogenic diabetes is mainly caused by rare genetic variants that are usually evaluated manually.The data on the variants are stored in several resources and are not unified in terms of the genomic coordinates, alleles, and variant annotation.What can be done for the systematic evaluation of the variants and their protein consequences?In this work, we have created an automated Jupyter notebook-based pipeline for the collection and unification of the variants associated with monogenic diabetes.The database of the genetic variants was created and translated to all possible variant protein sequences.These results will be used for the analysis of proteomics data and protein structure modeling.
Publisher
Cold Spring Harbor Laboratory