Abstract
AbstractGenome search and/or classification is a key step in microbiome studies and has recently become more challenging due to the increasing number of available (reference) genomes and the fact that traditional methods do not scale well with larger databases. By combining a kmer hashing-based genomic distance metric (ProbMinHash) with a graph based nearest neighbor search algorithm (called Hierarchical Navigable Small World Graphs, or HNSW), we developed a new program, GSearch, that is at least ten times faster than alternative tools due to O(log(N)) time complexity while maintaining high accuracy. GSearch can identify/classify 8,000 query genomes against all available microbial and viral species with sequenced genome representatives (n=∼65,000) within several minutes on a personal laptop, using only ∼6GB of memory. Further, GSearch can scale well with millions of database genomes based on a database splitting strategy. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.
Publisher
Cold Spring Harbor Laboratory
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献