Abstract
AbstractGenome-wide association studies (GWAS) identify the variants (Single Nucleotide polymorphisms) associated with a disease phenotype within populations. These genetic differences are essential in variations in incidence and mortalities, especially for Prostate cancer in the African population. Given the complexity of cancer, it is imperative to identify the variants that contribute to the development of the disease. The standard univariate analysis employed in GWAS may not capture the non-linear additive interactions between variants, which might affect the risk of developing Prostate cancer. This is because the interactions in complex diseases such as prostate cancer are usually non-linear and would benefit from a non-linear Machine Learning gradient boosting viz XGBoost (extreme gradient boosting). We applied the XGBoost algorithm and an iterative SNP selection algorithm to find the top features (SNPs) that best predict the risk of developing prostate cancer with a Support Vector Machine (SVM). The number of subjects was 907, and input features were 1,798,727 after appropriate quality control. The algorithm involved ten trials of 5-fold cross-validation to optimize the dataset’s hyperparameters and the prediction task’s second module (utilizing SVM). The model achieved AUC-ROC cure of 0.66, 0.57 and 0.55 on the Train, Dev and Test sets, respectively. The area under the Precision-Recall Curve was 0.69, 0.60 and 0.57 on the Train, Dev and Test sets, respectively. Furthermore, the final number of predictive risk variants was 2798, associated with 847 Ensembl genes. Interaction analysis showed that Nodes were 339 and the edges were 622 in the gene interaction network. This shows evidence that the non-linear Machine learning approach offers excellent possibilities for understanding the genetic basis of complex diseases.
Publisher
Cold Spring Harbor Laboratory