Quantifying the impact of uninformative features on the performance of supervised classification and dimensionality reduction algorithms-Reference-Cited by-同舟云学术

Quantifying the impact of uninformative features on the performance of supervised classification and dimensionality reduction algorithms

Published:2023-12-01 Issue:4 Volume:1 Page:
ISSN:2770-9019
Container-title:APL Machine Learning
language:en
Short-container-title:

Author:

Lei Weihua¹^ORCID,Zanchettin Cleber¹²^ORCID,Ho Zoey E.³^ORCID,Nunes Amaral Luís A.¹⁴⁵^ORCID

Affiliation:

1. Department of Physics and Astronomy, Northwestern University 1 , Evanston, Illinois 60208, USA

2. Centro de Informática, Universidade Federal de Pernambuco 2 , Recife, Pernambuco 52061080, Brazil

3. Department of Engineering Sciences and Applied Mathematics, Northwestern University 3 , Evanston, Illinois 60208, USA

4. Department of Chemical and Biological Engineering, Northwestern University 4 , Evanston, Illinois 60208, USA

5. Northwestern Institute on Complex Systems (NICO), Northwestern University 5 , Evanston, Illinois 60208, USA

Abstract

Machine learning approaches have become critical tools in data mining and knowledge discovery, especially when attempting to uncover relationships in high-dimensional data. However, researchers have noticed that a large fraction of features in high-dimensional datasets are commonly uninformative (too noisy or irrelevant). Because optimal feature selection is an NP-hard task, it is essential to understand how uninformative features impact the performance of machine learning algorithms. Here, we conduct systematic experiments on algorithms from a wide range of taxonomy families using synthetic datasets with different numbers of uninformative features and different numbers of patterns to be learned. Upon visual inspection, we classify these algorithms into four groups with varying robustness against uninformative features. For the algorithms in three of the groups, we find that when the number of uninformative features exceeds the number of data instances per pattern to be learned, the algorithms fail to learn the patterns. Finally, we investigate whether increasing the distinguishability of patterns or adding training instances can mitigate the effect of uninformative features. Surprisingly, we find that uninformative features still cause algorithms to suffer big losses in performance, even when patterns should be easily distinguishable. Analyses of real-world data show that our conclusions hold beyond the synthetic datasets we study systematically.

Funder

National Science Foundation

Publisher

AIP Publishing

Link

https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0170229/18252440/046118_1_5.0170229.pdf

Reference39 articles.

1. Multiple early factors anticipate post-acute COVID-19 sequelae;Su;Cell,2022

2. Applications of machine learning in drug discovery and development;Vamathevan;Nat. Rev. Drug Discovery,2019

3. Innovative materials science via machine learning;Gao;Adv. Funct. Mater.,2022

4. Human mobility, social ties, and link prediction;Wang,2011

5. Forecasting the evolution of fast-changing transportation networks using machine learning;Lei;Nat. Commun.,2022