An Improved Machine Learning-Based Approach to Assess the Microbial Diversity in Major North Indian River Ecosystems

Author:

Choudhury Nalinikanta12ORCID,Sahu Tanmaya Kumar3ORCID,Rao Atmakuri Ramakrishna24ORCID,Rout Ajaya Kumar56ORCID,Behera Bijay Kumar56ORCID

Affiliation:

1. ICAR—Indian Agricultural Research Institute, New Delhi 110012, India

2. ICAR—Indian Agricultural Statistics Research Institute, New Delhi 110012, India

3. ICAR—Indian Grassland and Fodder Research Institute, Jhansi 284003, India

4. Indian Council of Agricultural Research (ICAR), New Delhi 110001, India

5. ICAR—Central Inland Fisheries Research Institute, West Bengal 700120, India

6. Rani Lakshmi Bai Central Agricultural University, Jhansi 284003, India

Abstract

The rapidly evolving high-throughput sequencing (HTS) technologies generate voluminous genomic and metagenomic sequences, which can help classify the microbial communities with high accuracy in many ecosystems. Conventionally, the rule-based binning techniques are used to classify the contigs or scaffolds based on either sequence composition or sequence similarity. However, the accurate classification of the microbial communities remains a major challenge due to massive data volumes at hand as well as a requirement of efficient binning methods and classification algorithms. Therefore, we attempted here to implement iterative K-Means clustering for the initial binning of metagenomics sequences and applied various machine learning algorithms (MLAs) to classify the newly identified unknown microbes. The cluster annotation was achieved through the BLAST program of NCBI, which resulted in the grouping of assembled scaffolds into five classes, i.e., bacteria, archaea, eukaryota, viruses and others. The annotated cluster sequences were used to train machine learning algorithms (MLAs) to develop prediction models to classify unknown metagenomic sequences. In this study, we used metagenomic datasets of samples collected from the Ganga (Kanpur and Farakka) and the Yamuna (Delhi) rivers in India for clustering and training the MLA models. Further, the performance of MLAs was evaluated by 10-fold cross validation. The results revealed that the developed model based on the Random Forest had a superior performance compared to the other considered learning algorithms. The proposed method can be used for annotating the metagenomic scaffolds/contigs being complementary to existing methods of metagenomic data analysis. An offline predictor source code with the best prediction model is available at (https://github.com/Nalinikanta7/metagenomics).

Funder

Post Graduate School, ICAR-Indian Agricultural Research Institute

Indian Council of Agricultural Research

Publisher

MDPI AG

Subject

Genetics (clinical),Genetics

Reference61 articles.

1. Emerging Priorities for Microbiome Research;Cullen;Front. Microbiol.,2020

2. Microsatellite Analysis Reveals Low Genetic Diversity in Managed Populations of the Critically Endangered Gharial (Gavialis Gangeticus) in India;Sharma;Sci. Rep.,2021

3. Focus: Microbiome: Metagenomic Assembly: Overview, Challenges and Applications;Ghurye;Yale J. Biol. Med.,2016

4. Metagenomics: Application of Genomics to Uncultured Microorganisms;Handelsman;Microbiol. Mol. Biol. Rev.,2005

5. The Binning of Metagenomic Contigs for Microbial Physiology of Mixed Cultures;Strous;Front. Microbiol.,2012

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3