Affiliation:
1. School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, USA
2. Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
Abstract
Background
Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods.
Methods
Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools.
Results
LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.
Funder
National Science Foundation
College of Agriculture and Life Sciences at Virginia Polytechnic Institute and State University
Virginia Agricultural Experiment Station and the Hatch Program of the National Institute of Food and Agriculture, US Department of Agriculture
Subject
General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience
Reference20 articles.
1. Deoxyribonucleic acid reassociation in the taxonomy of enteric bacteria;Brenner;International Journal of Systematic and Evolutionary Microbiology,1973
2. On the resemblance and containment of documents;Broder,1997
3. sourmash: a library for MinHash sketching of DNA;Brown;Journal of Open Source Software,2016
4. DNA–DNA hybridization values and their relationship to whole-genome sequence similarities;Goris;International Journal of Systematic and Evolutionary Microbiology,2007
5. A fast approximate algorithm for mapping long reads to large reference databases;Jain;Journal of Computational Biology: A Journal of Computational Molecular Cell Biology,2018a
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献