Affiliation:
1. Department of Computer Science and Engineering, University of California San Diego , La Jolla, CA 92093, United States
2. School of Electrical and Computer Engineering, Georgia Institute of Technology , Atlanta, GA 30332, United States
Abstract
Abstract
Motivation
Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative k-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections.
Results
We evaluate HyperGen’s sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy.
Availability and implementation
A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https://github.com/wh-xu/Hyper-Gen. The scripts to reproduce the experimental results can be accessed at https://github.com/wh-xu/experiment-hyper-gen.
Funder
Center for Processing with Intelligent Storage and Memory
Publisher
Oxford University Press (OUP)
Reference40 articles.
1. Dashing: fast and accurate genomic distances with hyperloglog;Baker;Genome Biol,2019
2. Genomic sketching with multiplicities and locality-sensitive hashing using dashing 2;Baker;Genome Res,2023
3. sourmash: a library for minhash sketching of DNA;Brown;JOSS,2016
4. Gtdb-tk v2: memory friendly classification with the genome taxonomy database;Chaumeil;Bioinformatics,2022