LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes-Reference-Cited by-同舟云学术

LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

Published:2021-03-24 Issue: Volume:9 Page:e10906
ISSN:2167-8359
Container-title:PeerJ
language:en
Short-container-title:

Author:

Tian Long¹,Mazloom Reza²,Heath Lenwood S.²^ORCID,Vinatzer Boris A.¹^ORCID

Affiliation:

1. School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, USA

2. Department of Computer Science, Virginia Tech, Blacksburg, VA, USA

Abstract

Background Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.

Funder

National Science Foundation

College of Agriculture and Life Sciences at Virginia Polytechnic Institute and State University

Virginia Agricultural Experiment Station and the Hatch Program of the National Institute of Food and Agriculture, US Department of Agriculture

Publisher

PeerJ

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience

Link

https://peerj.com/articles/10906.pdf

Reference20 articles.

1. Deoxyribonucleic acid reassociation in the taxonomy of enteric bacteria;Brenner;International Journal of Systematic and Evolutionary Microbiology,1973

2. On the resemblance and containment of documents;Broder,1997

3. sourmash: a library for MinHash sketching of DNA;Brown;Journal of Open Source Software,2016

4. DNA–DNA hybridization values and their relationship to whole-genome sequence similarities;Goris;International Journal of Systematic and Evolutionary Microbiology,2007

5. A fast approximate algorithm for mapping long reads to large reference databases;Jain;Journal of Computational Biology: A Journal of Computational Molecular Cell Biology,2018a

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Rapid and Accurate Estimation of Genetic Relatedness Between Millions of Viral Genome Pairs Using MANIAC;2024-04-28

2. Genomic delineation and description of species and within-species lineages in the genus Pantoea;Frontiers in Microbiology;2023-11-09

3. Dysgonomonas mossii Strain Shenzhen WH 0221, a New Member of the Genus Dysgonomonas Isolated from the Blood of a Patient with Diabetic Nephropathy, Exhibits Multiple Antibiotic Resistance;Microbiology Spectrum;2022-08-31

4. LINgroups as a principled approach to compare and integrate multiple bacterial taxonomies;Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics;2022-08-07

5. Meta-analysis of the Ralstonia solanacearum species complex (RSSC) based on comparative evolutionary genomics and reverse ecology;Microbial Genomics;2022-03-17