Author:
Shukla Rohit,Singh Tiratha Raj
Abstract
AbstractBackground and ObjectiveAD is a progressive neurodegenerative disorder characterized by memory loss. Due to the advancement in next-generation sequencing technologies, an enormous amount of AD-associated genomics data is available. However, the information about the involvement of these genes in AD association is still a research topic because all these algorithms are based on statistical techniques. Therefore, AlzGenPred is developed to identify the AD-associated genes from a large set of data.MethodsTo develop the AlzGenPred, we have compiled a benchmark dataset consisting of 1086 AD and non-AD genes and used them as positive and negative datasets. We have generated several features including the fused features and evaluated them through machine learning methods. Then hyperparameter tuning approach was also applied and the final model was selected. The proposed method was validated by using the AlzGene and transcriptomics datasets and proposed as a standalone tool.ResultsTotal 13504 features belonging to eight different encoding schemes of these sequences were generated and evaluated by using 16 ML algorithms. It reveals that network-based features can classify AD genes while sequence-based features are not able to classify them. Then we generated 24 different fused features (6020 D) using sequence-based features and fed them into a two-step lightGBM-based recursive feature selection method. It increased up to 5-7% accuracy. After that selected eight fused features with CKSAAP were used for the hyperparameter tuning. They showed <70% accuracy. Therefore, network-based features were used to generate the CatBoost-based ML method called AlzGenPred with 96.55% accuracy and 98.99% AUROC. The developed method is tested on the AlzGene dataset where it showed 96.43% accuracy. Then the model is validated using the transcriptomics dataset also.ConclusionThe validation of AlzGenPred using the AlzGene dataset and transcriptomics dataset obtained from Human, mouse, and ES-derived neural cells revealed that it can classify the omics data and can sort the AD-associated genes. These predicted genes can be directly used in the wet lab for further testing which will reduce labor cost and time expenses. The AlzGenPred is developed as a standalone package and is available for users athttps://www.bioinfoindia.org/alzgenpred/andhttps://github.com/shuklarohit815/AlzGenPred.
Publisher
Cold Spring Harbor Laboratory