Benchmark of Data Processing Methods and Machine Learning Models for Gut Microbiome-Based Diagnosis of Inflammatory Bowel Disease-Reference-Cited by-同舟云学术

Benchmark of Data Processing Methods and Machine Learning Models for Gut Microbiome-Based Diagnosis of Inflammatory Bowel Disease

Published:2022-02-14 Issue: Volume:13 Page:
ISSN:1664-8021
Container-title:Frontiers in Genetics
language:
Short-container-title:Front. Genet.

Author:

Kubinski Ryszard,Djamen-Kepaou Jean-Yves,Zhanabaev Timur,Hernandez-Garcia Alex,Bauer Stefan,Hildebrand Falk,Korcsmaros Tamas,Karam Sani,Jantchou Prévost,Kafi Kamran,Martin Ryan D.

Abstract

Patients with inflammatory bowel disease (IBD) wait months and undergo numerous invasive procedures between the initial appearance of symptoms and receiving a diagnosis. In order to reduce time until diagnosis and improve patient wellbeing, machine learning algorithms capable of diagnosing IBD from the gut microbiome’s composition are currently being explored. To date, these models have had limited clinical application due to decreased performance when applied to a new cohort of patient samples. Various methods have been developed to analyze microbiome data which may improve the generalizability of machine learning IBD diagnostic tests. With an abundance of methods, there is a need to benchmark the performance and generalizability of various machine learning pipelines (from data processing to training a machine learning model) for microbiome-based IBD diagnostic tools. We collected fifteen 16S rRNA microbiome datasets (7,707 samples) from North America to benchmark combinations of gut microbiome features, data normalization and transformation methods, batch effect correction methods, and machine learning models. Pipeline generalizability to new cohorts of patients was evaluated with two binary classification metrics following leave-one-dataset-out cross (LODO) validation, where all samples from one study were left out of the training set and tested upon. We demonstrate that taxonomic features processed with a compositional transformation method and batch effect correction with the naive zero-centering method attain the best classification performance. In addition, machine learning models that identify non-linear decision boundaries between labels are more generalizable than those that are linearly constrained. Lastly, we illustrate the importance of generating a curated training dataset to ensure similar performance across patient demographics. These findings will help improve the generalizability of machine learning models as we move towards non-invasive diagnostic and disease management tools for patients with IBD.

Funder

Biotechnology and Biological Sciences Research Council

Horizon 2020

Publisher

Frontiers Media SA

Subject

Genetics (clinical),Genetics,Molecular Medicine

Reference106 articles.

1. The Statistical Analysis of Compositional Data;Aitchison;J. R. Stat. Soc. Ser. B (Methodological),1982

2. Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns;Amir;mSystems,2017

3. Gut Microbiome Function Predicts Response to Anti-Integrin Biologic Therapy in Inflammatory Bowel Diseases;Ananthakrishnan;Cell Host & Microbe,2017

4. Access to Specialist Gastroenterology Care in Canada: The Practice Audit in Gastroenterology (PAGE) Wait Times Program;Armstrong;Can. J. Gastroenterol.,2008

5. Establishment and Evaluation of Prediction Model for Multiple Disease Classification Based on Gut Microbial Data;Bang;Sci. Rep.,2019

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Machine Learning-Based Diagnostic Model for Crohn’s Disease and Ulcerative Colitis Utilizing Fecal Microbiome Analysis;Microorganisms;2023-12-24

2. Nine (not so simple) steps: a practical guide to using machine learning in microbial ecology;mBio;2023-12-21

3. Overview of data preprocessing for machine learning applications in human microbiome research;Frontiers in Microbiology;2023-10-05

4. Potential Oral Microbial Markers for Differential Diagnosis of Crohn’s Disease and Ulcerative Colitis Using Machine Learning Models;Microorganisms;2023-06-26

5. Conditional Forest Models Built Using Metagenomic Data Accurately Predicted Salmonella Contamination in Northeastern Streams;Microbiology Spectrum;2023-04-13