The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies-Reference-Cited by-同舟云学术

The use of class imbalanced learning methods on ULSAM data to predict the case–control status in genome-wide association studies

Published:2023-11-30 Issue:1 Volume:10 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Öztornaci R. Onur,Syed Hamzah,Morris Andrew P.,Taşdelen Bahar

Abstract

AbstractMachine learning (ML) methods for uncovering single nucleotide polymorphisms (SNPs) in genome-wide association study (GWAS) data that can be used to predict disease outcomes are becoming increasingly used in genetic research. Two issues with the use of ML models are finding the correct method for dealing with imbalanced data and data training. This article compares three ML models to identify SNPs that predict type 2 diabetes (T2D) status using the Support vector machine SMOTE (SVM SMOTE), The Adaptive Synthetic Sampling Approach (ADASYN), Random under sampling (RUS) on GWAS data from elderly male participants (165 cases and 951 controls) from the Uppsala Longitudinal Study of Adult Men (ULSAM). It was also applied to SNPs selected by the SMOTE, SVM SMOTE, ADASYN, and RUS clumping method. The analysis was performed using three different ML models: (i) support vector machine (SVM), (ii) multilayer perceptron (MLP) and (iii) random forests (RF). The accuracy of the case–control classification was compared between these three methods. The best classification algorithm was a combination of MLP and SMOTE (97% accuracy). Both RF and SVM achieved good accuracy results of over 90%. Overall, methods used against unbalanced data, all three ML algorithms were found to improve prediction accuracy.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-023-00853-x.pdf

Reference66 articles.

1. Fadista J, Manning AK, Florez JC, Groop L. The (in) famous GWAS P-value threshold revisited and updated for low-frequency variants. Eur J Hum Genet. 2016;24:1202–5.

2. Szymczak S, Biernacka JM, Cordell HJ, González-Recio O, König IR, Zhang H, Sun YV. Machine learning in genome-wide association studies. Genet Epidemiol. 2009;33:S51–7.

3. Cosgun E, Limdi NA, Duarte CW. High-dimensional pharmacogenetic prediction of a continuous trait using machine learning techniques with application to warfarin dose prediction in African Americans. Bioinformatics. 2011;27:1384–9.

4. Tang Y, Zhang Y-Q, Chawla NV, Krasser S. SVMs modeling for highly imbalanced classification, IEEE transactions on systems, man, and cybernetics. Part B. 2008;39:281–8.

5. Dai X, Fu G, Zhao S, Zeng Y. Statistical learning methods applicable to genome-wide association studies on unbalanced case-control disease data. Genes. 2021;12:736.