Hybrid ANOVA and LASSO Methods for Feature Selection and Linear Support Vector, Multilayer Perceptron and Random Forest Classifiers Based on Spark Environment for Microarray Data Classification

Author:

Albaldawi Wafaa S,Almuttairi Rafah M

Abstract

Abstract Microarray dataset frequently contains a countless number of insignificant and irrelevant genes that might lead to loss of valuable data. The classes with both high importance and high significance gene sets are commonly preferred for selecting the genes, which determines the sample classification into their particular classes. This property has obtained a lot of importance among the specialists and experts in microarray dataset classification. The trained classifier model is tested for cancer datasets and Huntington disease data (HD) which consists of Prostate cancer (Singh) dataset comprising 102 samples, 52 of which are tumors and 50 are normal with 12625 genes. The lung cancer (Gordon) dataset comprises 181 samples, 150 of which are normal and 31 are tumors with 12533 genes. The breast cancer (Chin) dataset comprises 118 samples, 43 of which are normal and 75 are tumors with 22215 genes. The breast cancer (Chowdary) dataset comprises 104 samples, 62 of which are normal and 42 are tumors with 22283 genes. Finally, the Huntington disease (Borovecki) dataset comprises 31 samples, 14 of which are normal and 17 are with Huntington’s disease with 22283 genes. This paper uses Multilayer Perceptron Classifier (MLP), Random Forest (RF) and Linear Support Vector classifier (LSVC) classification algorithms with six different feature selection methods named as Principal Component Analysis (PCA), Extra Tree Classifier (ETC), Analysis of Variance (ANOVA), Least Absolute Shrinkage and Selection Operator (LASSO), Chi-Square and Random Forest Regressor (RFR). Further, the paper presents a comparative analysis on the obtained classification accuracy and time consumed among the models in Spark environment and in conventional system. Performance parameters such as accuracy and time consumed are applied in this comparative analysis to analyze the behavior of the classifiers in the two environments. Th results indicate that the models in spark environment was extremely effective for processing large-dimension data, which cannot be processed with conventional implementation related to a some algorithms. After that, a proposed hybrid model containing embedded approach (LASSO) and the Filter (ANOVA) approach was used to select the optimized features form the high dimensional dataset. With the reduced dimension of features, classification is performed on the reduced data set to classify the samples into normal or abnormal and applied in spark in hadoop cluster (distributed manner). The proposed model achieved accuracy of 100% in case of Borovecki dataset when using all classifiers, 100% in case of Singh, Chowdary and Gordon datasets when classified with RF and LSVC classifiers. Also, accuracy was 96% in case of Chin dataset when using RF classifier with optimal genes with respect to accuracy and time consumed.

Publisher

IOP Publishing

Subject

General Medicine

Reference35 articles.

1. Classification of microarray data using SVM mapreduce

2. Bi-Level Dimensionality Reduction Methods Using Feature Selection and Feature Extraction;Veerabhadrappa;Int. J. Comput. Appl.,2010

3. The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures;Haury;PLoS ONE,2011

4. A Survey on Semi-Supervised Feature Selection Methods;Sheikhpour;Pattern Recogn,2017

5. Feature Selection Methods And Algorithms;Ladha;Int. j. Eng.,2011

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3