Development of a fast feature extraction method for SARS-CoV-2 spike sequences using amino acid physicochemical properties

Author:

Oka Hiroya,Noba Kosaku,Sasahara Jun,Hashimoto Takayuki,Yoshimoto Shogo,Niioka Hirohiko,Miyake Jun,Hori KatsutoshiORCID

Abstract

AbstractCOVID-19 continues to spread today, leading to an accumulation of SARS-CoV-2 virus mutations in databases, and large amounts of genomic datasets are currently available. However, due to these large datasets, utilizing this amount of sequence data without random sampling is challenging. Major difficulties for downstream analyses include the increase in the dimension size along with the conversion of sequences into numerical values when using conventional amino acid representation methods, such as one-hot encoding andk-mer-based approaches that directly reflect sequences. Moreover, these sequences are deficient in physicochemical characteristics, such as structural information and hydrophilicity; hence, they fail to accurately represent the inherent function of the given sequences. In this study, we utilized the physicochemical properties of amino acids to develop a rapid and efficient approach for extracting feature parameters that are suitable for downstream processes of machine learning, such as clustering. A fixed-length feature vector representation of a spike sequence with reduced dimensionality was obtained by converting amino acid residues into physicochemical parameters. Next, t-distributed stochastic neighbor embedding (t- SNE), a method for dimensionality reduction and visualization of high-dimensional data, was performed, followed by density-based spatial clustering of applications with noise (DBSCAN). The results show that by using the physicochemical properties of amino acids rather than conventional methods that directly represent sequences into numerical values, SARS-CoV-2 spike sequences can be clustered with sufficient accuracy and a shorter runtime. Interestingly, the clusters obtained by using amino acid properties include subclusters that are distinct from those produced utilizing the method for the direct representation of amino acid sequences. A more detailed analysis indicated that the contributing parameters of this novel cluster identified exclusively when utilizing the physicochemical properties of amino acids significantly differ from one another. This suggests that representing amino acid sequences by physicochemical properties might enable the identification of clusters with enhanced sensitivity compared to conventional methods.Author summaryOne of the major causes of the global threat of SARS-CoV-2 is the rapid emergence of its variants. While analyzing these variants is crucial for understanding the mechanism of outbreaks, the expansion of database size is becoming a barrier for effective analysis. In this study, we provide an approach that allows researchers without vast computational resources to comprehensively analyze the variants of SARS-CoV-2 spike by representing the sequences using the physicochemical properties of amino acids. The result of clusters derived using this method demonstrates not only an accuracy comparable to the conventional approaches of directly converting sequences into numerical values but also indicates the potential for more detailed clustering outcomes. The results suggest that our approach is valuable for the rapid identification of characteristic residues in new variants of SARS-CoV-2 and other viruses that may arise in the future.

Publisher

Cold Spring Harbor Laboratory

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3