Abstract
AbstractDimension reduction (or embedding), as a popular way to visualize data, has been a fundamental technique in many applications. Non-linear dimension reduction such as t-SNE and UMAP has been widely used in visualizing single cell RNA sequencing data and metagenomic binning and thus receive many attentions in bioinformatics and computational biology. Here in this paper, we further improve UMAP-like non-linear dimension reduction algorithms by updating the graph- based nearest neighbor search algorithm (e.g. we use Hierarchical Navigable Small World Graph, or HNSW instead of K-graph) and combine several aspects of t-SNE and UMAP to create a new non-linear dimension reduction algorithm. We also provide several additional features including computation of LID (Local Intrinsic Dimension) and hubness, which can reflect structures and properties of the underlying data that strongly affect nearest neighbor search algorithm in traditional UMAP-like algorithms and thus the quality of embeddings. We also combined the improved non-linear dimension reduction algorithm with probabilistic data structures such as MinHash-likes ones (e.g., ProbMinHash et.al.) for large-scale biological sequence data visualization. Our library is called annembed and it was implemented and fully parallelized in Rust. We benchmark it against popular tools mentioned above using standard testing datasets and it showed competitive accuracy. Additionally, we apply our library in three real-world problems: visualizing large-scale microbial genomic database, visualizing single cell RNA sequencing data and metagenomic binning, to showcase the performance, scalability and efficiency of the library when distance computation is expensive or when the number of data points is large (e.g. millions or billions). Annembed can be found here:https://github.com/jean-pierreBoth/annembed
Publisher
Cold Spring Harbor Laboratory
Reference63 articles.
1. Amid, E. and Warmuth, M.K. TriMap: Large-scale dimensionality reduction using triplets. arXiv preprint arXiv:1910.00204 2019.
2. Amsaleg, L. , et al. Estimating local intrinsic dimensionality. In, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2015. p. 29–38.
3. Amsaleg, L. , et al. Intrinsic dimensionality estimation within tight localities. In, Proceedings of the 2019 SIAM international conference on data mining . SIAM; 2019. p. 181–189.
4. Argerich, L. and Golmar, N . Generic LSH Families for the Angular Distance Based on Johnson- Lindenstrauss Projections and Feature Hashing LSH. arXiv preprint arXiv:1704.04684 2017.
5. Aumüller, M. , Bernhardsson, E. and Faithfull, A . ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems 2020;87:101374.