Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences
Author:
Ali Sarwan1ORCID, Patterson Murray1ORCID
Affiliation:
1. Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA
Abstract
Data visualization plays a crucial role in gaining insights from high-dimensional datasets. ISOMAP is a popular algorithm that maps high-dimensional data into a lower-dimensional space while preserving the underlying geometric structure. However, ISOMAP can be computationally expensive, especially for large datasets, due to the computation of the pairwise distances between data points. The motivation behind this study is to improve efficiency by leveraging an approximate method, which is based on random kitchen sinks (RKS). This approach provides a faster way to compute the kernel matrix. Using RKS significantly reduces the computational complexity of ISOMAP while still obtaining a meaningful low-dimensional representation of the data. We compare the performance of the approximate ISOMAP approach using RKS with the traditional t-SNE algorithm. The comparison involves computing the distance matrix using the original high-dimensional data and the low-dimensional data computed from both t-SNE and ISOMAP. The quality of the low-dimensional embeddings is measured using several metrics, including mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). Additionally, the runtime of each algorithm is recorded to assess its computational efficiency. The comparison is conducted on a set of protein sequences, used in many bioinformatics tasks. We use three different embedding methods based on k-mers, minimizers, and position weight matrix (PWM) to capture various aspects of the underlying structure and the relationships between the protein sequences. By comparing different embeddings and by evaluating the effectiveness of the approximate ISOMAP approach using RKS and comparing it against t-SNE, we provide insights on the efficacy of our proposed approach. Our goal is to retain the quality of the low-dimensional embeddings while improving the computational performance.
Funder
Molecular Basis of Disease (MBD) fellowship at Georgia State University Startup Grant at Georgia State University
Reference37 articles.
1. Donalek, C., Djorgovski, S.G., Cioc, A., Wang, A., Zhang, J., Lawler, E., Yeh, S., Mahabal, A., Graham, M., and Drake, A. (2014, January 27–30). Immersive and collaborative data visualization using virtual reality platforms. Proceedings of the 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA. 2. Protopsaltis, A., Sarigiannidis, P., Margounakis, D., and Lytos, A. (2020, January 25–28). Data visualization in internet of things: Tools, methodologies, and challenges. Proceedings of the 15th International Conference on Availability, Reliability and Security, Dublin, Ireland. 3. High-dimensional data analysis: The curses and blessings of dimensionality;Donoho;AMS Math Chall. Lect.,2000 4. The properties of high-dimensional data spaces: Implications for exploring gene and protein expression data;Clarke;Nat. Rev. Cancer,2008 5. Supervised nonlinear dimensionality reduction for visualization and classification;Geng;IEEE Trans. Syst. Man Cybern. Part B Cybern.,2005
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|