Enhancing t-SNE Performance for Biological Sequencing Data through Kernel Selection-Reference-Cited by-同舟云学术

Enhancing t-SNE Performance for Biological Sequencing Data through Kernel Selection

Published:2023-08-22 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Chourasia Prakash,Murad Taslim,Ali Sarwan,Patterson Murray

Abstract

AbstractThe genetic code for many different proteins can be found in biological sequencing data, which offers vital insight into the genetic evolution of viruses. While machine learning approaches are becoming increasingly popular for many “Big Data” situations, they have made little progress in comprehending the nature of such data. One such area is the t-distributed Stochastic Neighbour Embedding (t-SNE), a generalpurpose approach used to represent high dimensional data in low dimensional (LD) space while preserving similarity between data points. Traditionally, the Gaussian kernel is used with t-SNE. However, since the Gaussian kernel is not data-dependent, it determines each local bandwidth based on one local point only. This makes it computationally expensive, hence limited in scalability. Moreover, it can misrepresent some structures in the data. An alternative is to use the isolation kernel, which is a data-dependent method. However, it has a single parameter to tune in computing the kernel. Although the isolation kernel yields better performance in terms of scalability and preserving the similarity in LD space, it may still not perform optimally in some cases. This paper presents a perspective on improving the performance of t-SNE and argues that kernel selection could impact this performance. We use 9 different kernels to evaluate their impact on the performance of t-SNE, using SARS-CoV-2 “spike” protein sequences. With three different embedding methods, we show that the cosine similarity kernel gives the best results and enhances the performance of t-SNE.

Publisher

Cold Spring Harbor Laboratory

Reference25 articles.

1. Ali, S. , Bello, B. , Chourasia, P. , et al.: Pwm2vec: An efficient embedding approach for viral host specification from coronavirus spike sequences. MDPI Biology (2022)

2. Ali, S. , Patterson, M. : Spike2vec: An efficient and scalable embedding approach for covid-19 spike sequences. In: International Conference on Big Data (Big Data). pp. 1533–1540 (2021)

3. Ali, S. , Sahoo, B. , et al.: A k-mer based approach for sars-cov-2 variant identification. In: International Symposium on Bioinformatics Research and Applications. pp. 153–164 (2021)

4. Ali, S. , Tamkanat-E-Ali , et al.: Effective and scalable clustering of sars-cov-2 sequences. In: International Conference on Big Data Research (ICBDR). pp. 1–8 (2021)

5. Ali, S. , Zhou, Y. , Patterson, M. : Efficient analysis of covid-19 clinical data using machine learning models. arXiv preprint arXiv:2110.09606 (2021)