A Distributed Attribute Reduction Algorithm for High-Dimensional Data under the Spark Framework-Reference-Cited by-同舟云学术

A Distributed Attribute Reduction Algorithm for High-Dimensional Data under the Spark Framework

Published:2022-04-05 Issue:1 Volume:15 Page:
ISSN:1875-6883
Container-title:International Journal of Computational Intelligence Systems
language:en
Short-container-title:Int J Comput Intell Syst

Author:

Wu Zhengjiang^ORCID,Mei Qiuyu,Zhang Yaning,Yang Tian,Luo Junwei

Abstract

AbstractAttribute reduction is an important issue in rough set theory. However, the rough set theory-based attribute reduction algorithms need to be improved to deal with high-dimensional data. A distributed version of the attribute reduction algorithm is necessary to enable it to effectively handle big data. The partition of attribute space is an important research direction. In this paper, a distributed attribution reduction algorithm based on cosine similarity (DARCS) for high-dimensional data pre-processing under the Spark framework is proposed. First, to avoid the repeated calculation of similar attributes, the algorithm gathers similar attributes based on similarity measure to form multiple clusters. And then one attribute is selected randomly as a representative from each cluster to form a candidate attribute subset to participate in the subsequent reduction operation. At the same time, to improve computing efficiency, an improved method is introduced to calculate the attribute dependency in the divided sub-attribute space. Experiments on eight datasets show that, on the premise of avoiding critical information loss, the reduction ability and computing efficiency of DARCS have been improved by 0.32 to 39.61% and 31.32 to 93.79% respectively compared to the distributed version of attribute reduction algorithm based on a random partitioning of the attributes space.

Funder

National Natural Science Foundation of China

Publisher

Springer Science and Business Media LLC

Subject

Computational Mathematics,General Computer Science

Link

https://link.springer.com/content/pdf/10.1007/s44196-022-00076-7.pdf

Reference40 articles.

1. Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19(2), 171–209 (2014)

2. Li, T., Luo, C., Chen, H., Zhang, J.: Pickt: a solution for big data analysis. In: International Conference on Rough Sets and Knowledge Technology, pp. 15–25 (2015). Springer

3. Gao, L., Song, J., Liu, X., Shao, J., Liu, J., Shao, J.: Learning in high-dimensional multimedia data: the state of the art. Multimedia Syst. 23(3), 303–313 (2017)

4. Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2013)

5. Anderson, M., Cafarella, M.: Input selection for fast feature engineering. In: IEEE International Conference on Data Engineering, pp. 577–588 (2016)

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Task allocation algorithm for distributed large data stream group computing in the era of digital intelligence;Journal of Intelligent & Fuzzy Systems;2024-04-18

2. An Acceleration Method for Attribute Reduction Based on Attribute Synthesis;Rough Sets;2023