Affiliation:
1. Institute of Informatics Federal University of Rio Grande do Sul Porto Alegre Rio Grande do Sul Brazil
2. Faculty of Computer Science Dalhousie University Halifax Nova Scotia Canada
3. Institute of Biosciences Federal University of Rio Grande do Sul Porto Alegre Rio Grande do Sul Brazil
4. National Institute of Science and Technology ‐ Forensic Science Porto Alegre Rio Grande do Sul Brazil
5. Center for Biotechnology Federal University of Rio Grande do Sul Porto Alegre Rio Grande do Sul Brazil
Abstract
AbstractFeature selection algorithms are frequently employed in preprocessing machine learning pipelines applied to biological data to identify relevant features. The use of feature selection in gene expression studies began at the end of the 1990s with the analysis of human cancer microarray datasets. Since then, gene expression technology has been perfected, the Human Genome Project has been completed, new microarray platforms have been created and discontinued, and RNA‐seq has gradually replaced microarrays. However, most feature selection methods in the last two decades were designed, evaluated, and validated on the same datasets from the microarray technology's infancy. In this review of over 1200 publications regarding feature selection and gene expression, published between 2010 and 2020, we found that 57% of the publications used at least one outdated dataset, 23% used only outdated data, and 32% did not cite data sources. Other issues include referencing databases that are no longer available, the slow adoption of RNA‐seq datasets, and bias toward human cancer data, even for methods designed for a broader scope. In the most popular datasets, some being 23 years old, mislabeled samples, experimental biases, distribution shifts, and the absence of classification challenges are common. These problems are more predominant in publications with computer science backgrounds compared to publications from biology and can lead to inaccurate and misleading biological results.This article is categorized under:
Algorithmic Development > Biological Data Mining
Technologies > Machine Learning
Funder
Conselho Nacional de Desenvolvimento Científico e Tecnológico
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul
Global Affairs Canada
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献