Large-Scale Online Feature Selection for Ultra-High Dimensional Sparse Data-Reference-Cited by-同舟云学术

Large-Scale Online Feature Selection for Ultra-High Dimensional Sparse Data

Published:2017-11-30 Issue:4 Volume:11 Page:1-22
ISSN:1556-4681
Container-title:ACM Transactions on Knowledge Discovery from Data
language:en
Short-container-title:ACM Trans. Knowl. Discov. Data

Author:

Wu Yue¹^ORCID,Hoi Steven C. H.²,Mei Tao³,Yu Nenghai⁴

Affiliation:

1. University of Science and Technology of China, Singapore Management University, Hefei, China

2. Singapore Management University, Stamford Road, Singapore

3. University of Science and Technology of China, Microsoft Research Asia, Beijing, China

4. University of Science and Technology of China, Hefei, China

Abstract

Feature selection (FS) is an important technique in machine learning and data mining, especially for large-scale high-dimensional data. Most existing studies have been restricted to batch learning, which is often inefficient and poorly scalable when handling big data in real world. As real data may arrive sequentially and continuously, batch learning has to retrain the model for the new coming data, which is very computationally intensive. Online feature selection (OFS) is a promising new paradigm that is more efficient and scalable than batch learning algorithms. However, existing online algorithms usually fall short in their inferior efficacy. In this article, we present a novel second-order OFS algorithm that is simple yet effective, very fast and extremely scalable to deal with large-scale ultra-high dimensional sparse data streams. The basic idea is to exploit the second-order information to choose the subset of important features with high confidence weights. Unlike existing OFS methods that often suffer from extra high computational cost, we devise a novel algorithm with a MaxHeap-based approach, which is not only more effective than the existing first-order algorithms, but also significantly more efficient and scalable. Our extensive experiments validated that the proposed technique achieves highly competitive accuracy as compared with state-of-the-art batch FS methods, meanwhile it consumes significantly less computational cost that is orders of magnitude lower. Impressively, on a billion-scale synthetic dataset (1-billion dimensions, 1-billion non-zero features, and 1-million samples), the proposed algorithm takes less than 3 minutes to run on a single PC.

Funder

International Research Centres in Singapore Funding Initiative

National Research Foundation, Prime Ministers Office, Singapore

National Natural Science Foundation of China

Key Laboratory Foundation of the Chinese Academy of Sciences

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3070646

Reference50 articles.

1. V. Bolón-Canedo N. Sánchez-Maroño and A. Alonso-Betanzos. 2015. Recent advances and emerging challenges of feature selection in the context of big data. Knowledge Based System 86 C (Sept. 2015) 33--45. 10.1016/j.knosys.2015.05.014 V. Bolón-Canedo N. Sánchez-Maroño and A. Alonso-Betanzos. 2015. Recent advances and emerging challenges of feature selection in the context of big data. Knowledge Based System 86 C (Sept. 2015) 33--45. 10.1016/j.knosys.2015.05.014

2. A review of feature selection methods on synthetic data

Cited by 19 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Feature Selection in the Data Stream Based on Incremental Markov Boundary Learning;IEEE Transactions on Neural Networks and Learning Systems;2023-10

2. Dynamic Graph Learning for Feature Selection;Dynamic Graph Learning for Dimension Reduction and Data Clustering;2023-09-21

3. Feature subset selection for data and feature streams: a review;Artificial Intelligence Review;2023-07-13

4. Adaptive Collaborative Soft Label Learning for Unsupervised Multi-View Feature Selection;ACM Transactions on Knowledge Discovery from Data;2023-06-28

5. Feature selection for online streaming high-dimensional data: A state-of-the-art review;Applied Soft Computing;2022-09