HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors-Reference-Cited by-同舟云学术

HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors

Published:2024-03-08 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Xu Weihong^ORCID,Hsu Po-Kai,Moshiri Niema,Yu Shimeng,Rosing Tajana

Abstract

AbstractMotivationGenomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representativek-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discretek-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections.ResultsWe evaluate HyperGen’s sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy.AvailabilityA Rust implementation of HyperGen is freely available under the MIT license as an open-source software project athttps://github.com/wh-xu/Hyper-Gen. The scripts to reproduce the experimental results can be accessed athttps://github.com/wh-xu/experiment-hyper-gen.Contactwexu@ucsd.edu

Publisher

Cold Spring Harbor Laboratory

Reference40 articles.

1. Performance of neural network basecalling tools for Oxford Nanopore sequencing

2. Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2

3. Broder, A. Z. (1997). On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE.

4. sourmash: a library for minhash sketching of dna;Journal of open source software,2016

5. Gtdb-tk v2: memory friendly classification with the genome taxonomy database;Bioinformatics,2022