Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data-Reference-Cited by-同舟云学术

Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data

Published:2023-09-22 Issue:1 Volume:24 Page:
ISSN:1471-2164
Container-title:BMC Genomics
language:en
Short-container-title:BMC Genomics

Author:

Castelli Pierluigi^ORCID,De Ruvo Andrea^ORCID,Bucciacchio Andrea^ORCID,D’Alterio Nicola^ORCID,Cammà Cesare^ORCID,Di Pasquale Adriano^ORCID,Radomski Nicolas^ORCID

Abstract

Abstract Background Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source prediction performance of the usual holdout method combined with the repeated k-fold cross-validation method. Methods A large collection of 1 100 L. monocytogenes genomes with known sources was built according to several genomic metrics to ensure authenticity and completeness of genomic profiles. Based on these genomic profiles (i.e. 7-locus alleles, core alleles, accessory genes, core SNPs and pan kmers), we developed a versatile workflow assessing prediction performance of different combinations of training dataset splitting (i.e. 50, 60, 70, 80 and 90%), data preprocessing (i.e. with or without near-zero variance removal), and learning models (i.e. BLR, ERT, RF, SGB, SVM and XGB). The performance metrics included accuracy, Cohen’s kappa, F1-score, area under the curves from receiver operating characteristic curve, precision recall curve or precision recall gain curve, and execution time. Results The testing average accuracies from accessory genes and pan kmers were significantly higher than accuracies from core alleles or SNPs. While the accuracies from 70 and 80% of training dataset splitting were not significantly different, those from 80% were significantly higher than the other tested proportions. The near-zero variance removal did not allow to produce results for 7-locus alleles, did not impact significantly the accuracy for core alleles, accessory genes and pan kmers, and decreased significantly accuracy for core SNPs. The SVM and XGB models did not present significant differences in accuracy between each other and reached significantly higher accuracies than BLR, SGB, ERT and RF, in this order of magnitude. However, the SVM model required more computing power than the XGB model, especially for high amount of descriptors such like core SNPs and pan kmers. Conclusions In addition to recommendations about machine learning practices for L. monocytogenes source attribution based on genomic data, the present study also provides a freely available workflow to solve other balanced or unbalanced multiclass phenotypes from binary and categorical genomic profiles of other microorganisms without source code modifications.

Funder

Italian Ministry of Health

Publisher

Springer Science and Business Media LLC

Subject

Genetics,Biotechnology

Link

https://link.springer.com/content/pdf/10.1186/s12864-023-09667-w.pdf

Reference141 articles.

1. Cossart P. Illuminating the landscape of host–pathogen interactions with the bacterium Listeria monocytogenes. Proc Natl Acad Sci. 2011;108:19484–91.

2. Radoshevich L, Cossart P. Listeria monocytogenes: towards a complete picture of its physiology and pathogenesis. Nat Rev Microbiol. 2018;16:32–46.

3. Henri C, Leekitcharoenphon P, Carleton HA, Radomski N, Kaas RS, Mariet J-F, et al. An assessment of different genomic approaches for inferring phylogeny of listeria monocytogenes. Front Microbiol. 2017;8:2351.

4. Commichaux S, Javkar K, Ramachandran P, Nagarajan N, Bertrand D, Chen Y, et al. Evaluating the accuracy of Listeria monocytogenes assemblies from quasimetagenomic samples using long and short reads. BMC Genomics. 2021;22:389.

5. Palma F, Brauge T, Radomski N, Mallet L, Felten A, Mistou M-Y, et al. Dynamics of mobile genetic elements of Listeria monocytogenes persisting in ready-to-eat seafood processing plants in France. BMC Genomics. 2020;21:130.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Using GWAS and Machine Learning to Identify and Predict Genetic Variants Associated with Foodborne Bacteria Phenotypic Traits;Methods in Molecular Biology;2024-09-06