A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement-Reference-Cited by-同舟云学术

A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement

Published:2023-10-05 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Huckvale Erik D.^ORCID,Moseley Hunter N.B.^ORCID

Abstract

AbstractThe mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Gene and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (∼26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

Publisher

Cold Spring Harbor Laboratory

Reference28 articles.

1. KEGG for taxonomy-based analysis of pathways and genomes;Nucleic Acids Res,2023

2. KEGG: Kyoto Encyclopedia of Genes and Genomes

3. Toward understanding the origin and evolution of cellular organisms

4. Predicting Biological Functions of Compounds Based on Chemical-Chemical Interactions

5. Parmar A , Katariya R , Patel V. A review on random forest: an ensemble classifier. In: Hemanth J , Fernando X , Lafata P , Baig Z , editors. International conference on intelligent data communication technologies and internet of things (ICICI) 2018. Cham: Springer International Publishing; 2019. p. 758–63.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Predicting The Pathway Involvement Of Metabolites Based on Combined Metabolite and Pathway Features;2024-04-02

2. In the AI science boom, beware: your results are only as good as your data;Nature;2024-02-01

3. Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites;Metabolites;2023-11-01

4. Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites;2023-10-05