LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data-Reference-Cited by-同舟云学术

LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data

Published:2022-03-31 Issue:1 Volume:23 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Rudar Josip,Porter Teresita M.,Wright Michael,Golding G. Brian,Hajibabaei Mehrdad

Abstract

AbstractBackgroundIdentification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery.ResultsWe developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada’s Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 ± 0.06. The use of recursive feature elimination did not impact LANDMark’s generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p ≤ 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision boundaries.ConclusionsOur work introduces LANDMark, a meta-classifier which blends the characteristics of several machine learning models into a decision tree and ensemble learning framework. To our knowledge, this is the first study to apply this type of ensemble approach to amplicon sequencing data and we have shown that analyzing these datasets using LANDMark can produce highly predictive and consistent models.

Funder

Food from Thought Project, Canada First Research Excellence Fund, Canada

Government of Canada through the Genomics Research and Development Initiative (GRDI) Ecobiomics Project

Natural Sciences and Engineering Research Council of Canada

Genome Canada

Ontario Genomics

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s12859-022-04631-z.pdf

Reference95 articles.

1. Callahan BJ, McMurdie PJ, Holmes SP. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J. 2017;11:2639–43.

2. Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembl. Briefings Bioinform. 2017;2017:1–15.

3. Auer L, Mariadassou M, O’Donohue M, Klopp C, Hernandez-Raquet G. Analysis of large 16S rRNA Illumina data sets: Impact of singleton read filtering on microbial community description. Mol Ecol Resour. 2017;17(6):122–32.

4. Mysara M, Njima M, Leys N, Raes J, Monsieurs P. From reads to operational taxonomic units: an ensemble processing pipeline for MiSeq amplicon sequencing data. Gigascience. 2017;6(2):1–10.

5. Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ. Removing noise from pyrosequenced amplicons. BMC Bioinform. 2011;12:38.

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source;Microbiology Spectrum;2024-04-02

2. Multivariate and multi-dimensional CFAR radar image for breast cancer detection;Signal, Image and Video Processing;2023-10-11

3. Microbial biomarkers of tree water status for next‐generation biomonitoring of forest ecosystems;Molecular Ecology;2023-10-10

4. Translational informatics for human microbiota: data resources, models and applications;Briefings in Bioinformatics;2023-05

5. Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta Diversity in Medically Relevant 16S Amplicon Sequencing Data;Microbiology Spectrum;2023-04-13