A scalable and effective rough set theory-based approach for big data pre-processing-Reference-Cited by-同舟云学术

A scalable and effective rough set theory-based approach for big data pre-processing

Published:2020-05-02 Issue:8 Volume:62 Page:3321-3386
ISSN:0219-1377
Container-title:Knowledge and Information Systems
language:en
Short-container-title:Knowl Inf Syst

Author:

Chelly Dagdia Zaineb^ORCID,Zarges Christine,Beck Gaël,Lebbah Mustapha

Abstract

AbstractA big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data.

Funder

H2020 Marie Sklodowska-Curie Actions

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Hardware and Architecture,Human-Computer Interaction,Information Systems,Software

Link

https://link.springer.com/content/pdf/10.1007/s10115-020-01467-y.pdf

Reference44 articles.

1. Afendi FM, Ono N, Nakamura Y, Nakamura K, Darusman LK, Kibinge N, Morita AH, Tanaka K, Horai H, Altaf-Ul-Amin M et al (2013) Data mining methods for omics and knowledge of crude medicinal plants toward big data biology. Comput Struct Biotechnol J 4(5):1–14

2. Aghdam MH, Ghasem-Aghaee N, Basiri ME (2009) Text feature selection using ant colony optimization. Expert Syst Appl 36(3):6843–6853

3. Ahmed S, Zhang M, Peng L (2013) Enhanced feature selection for biomarker discovery in LC-MS data using GP. In: Evolutionary computation (CEC), 2013 IEEE congress on. IEEE, pp 584–591