BinDash 2.0: New MinHash Scheme Allows Ultra-fast and Accurate Genome Search and Comparisons-Reference-Cited by-同舟云学术

BinDash 2.0: New MinHash Scheme Allows Ultra-fast and Accurate Genome Search and Comparisons

Published:2024-03-14 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Zhao Jianshu^ORCID,Zhao Xiaofei,Pierre-Both Jean,Konstantinidis Konstantinos T.

Abstract

AbstractMotivationComparing large number of genomes in term of their genomic distance is becoming more and more challenging because there is an increasing number of microbial genomes deposited in public databases. Nowadays, we may need to estimate pairwise distances between millions or even billions of genomes. Few softwares can perform such comparisons efficiently.ResultsHere we update the multi-threaded software BinDash by implementing several new MinHash algorithms and computational optimization (e.g. Simple Instruction Multiple Data, SIMD) for ultra-fast and accurate genome search and comparisons at trillion scale. That is, we implemented b-bit one-permutation rolling MinHash with optimal/faster densification with SIMD. Now with BinDash 2, we can perform 0.1 trillion (or ∼10^11) pairs of genome comparisons in about 1.8 hours on a descent computer cluster or several hours on personal laptops, a ∼50% or more improvement over original version. The ANI (average nucleotide identity) estimated by BinDash is well correlated with other accurate but much slower ANI estimators such as FastANI or alignment-based ANI. In line with the findings from comparing 90K genomes (∼10^9 comparisons) via FastANI, the 85% ∼ 95% ANI gap is consistent in our study of ∼10^11 prokaryotic genome comparisons via BinDash2, which indicates fundamental ecological and evolutionary forces keeping species-like unit (e.g., > 95% ANI) together.Availability and implementationBinDash is released under the Apache 2.0 license at:https://github.com/zhaoxiaofei/bindashContactkostas.konstantinidis@gatech.eduSupplementary informationSupplementary data are available at Bioinformatics online.

Publisher

Cold Spring Harbor Laboratory

Reference24 articles.

1. Dashing: fast and accurate genomic distances with HyperLogLog

2. The minimizer Jaccard estimator is biased and inconsistent

3. Broder, A.Z. On the resemblance and containment of documents. In, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). IEEE; 1997. p. 21–29.

4. Broder, A.Z. , et al. Min-wise independent permutations. In, Proceedings of the thirtieth annual ACM symposium on Theory of computing. 1998. p. 327–336.

5. sourmash: a library for MinHash sketching of DNA;Journal of open source software,2016

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A survey of k-mer methods and applications in bioinformatics;Computational and Structural Biotechnology Journal;2024-12