Classification of Micro-array Data in Apache Spark Framework-Reference-Cited by-同舟云学术

Classification of Micro-array Data in Apache Spark Framework

Published:2020-11-01 Issue:3 Volume:928 Page:032067
ISSN:1757-8981
Container-title:IOP Conference Series: Materials Science and Engineering
language:
Short-container-title:IOP Conf. Ser.: Mater. Sci. Eng.

Author:

Albaldawi Wafaa S.,Almuttairi Rafah M.

Abstract

Abstract Apache Spark is an emerging huge information analytics technology. Machine learning (ML) frameworks engineered on Spark are more ascendible compared with traditional ML frameworks. We tend to build SVMwithSGD(SVM with Stochastic Gradient Descent) and LinearRegressionWithSGD models by using Spark Python API (PySpark) to classify normal and tumor microarray samples. Microarray measures expression levels of thousands of genes in a very tissue or cell kind. Feature extraction and cross-validation are used to make sure effectiveness. The SVMwithSGD and LinearRegressionWithSGD models achieve associate degrees accuracies quite eightieths. This paper presents a study of feature selection methods effect, using a filter approach, on the accuracy and time consumed of supervised classification of cancer. A comparative evaluation among different selection methods: Principal Component Analysis (PCA), Independent Component Analysis (ICA) and Locally Linear Embedding (LLE) is carried out with SVMWithSGD or LogisticRegressionWithSGD classifier, using the datasets of prostate, cancer, lung and Huntington’s Disease samples. The classification results using SVMWithSGD and LogisticRegressionWithSGD (LGWithSGD) classifiers show that the SVMWithSGD classifier can present the highest accuracy and much time when compared with LGWithSGD. The results show that when we have classified with SVMWithSGD, PCA and SVMWithSGD is the best combination for analyzing the Borovecki, Gordon, and Chowdary datasets. While ICA and SVMWithSGD in the Singh and Chin datasets. Moreover, the results illustrate that when we have classified with LGWithSGD, PCA and LGWithSGD is the best combination for analyzing the Borovecki and Gordon datasets. While ICA and LGWithSGD in the Chowdary and Singh datasets. LLE and LGWithSGD is the best for analyzing Chin dataset.

Publisher

IOP Publishing

Subject

General Medicine

Link

https://iopscience.iop.org/article/10.1088/1757-899X/928/3/032067/pdf

Reference24 articles.

1. Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule;Wahid;Chemometrics and Intelligent Laboratory Systems,2020

2. On the classification techniques in data mining for microarray data classification;Adiwijaya;Journal of Physics: Conference Series,2018

3. A hybrid of clustering and quantum genetic algorithm for relevant genes selection for cancer microarray data;Sardana;International Journal of Knowledge-based and Intelligent Engineering Systems,2016

4. Gene Expression Data Classification Using Support Vector Machine and Mutual Information-based Gene Selection;Vanitha;Procedia Computer Science,2015

5. Identification of cancerous gene groups from microarray data by employing adaptive genetic and support vector machine technique;Shukla;Computational Intelligence,2019