Abstract
AbstractCOVID-19 continues to spread today, leading to an accumulation of SARS-CoV-2 virus mutations in databases, and large amounts of genomic datasets are currently available. However, due to these large datasets, utilizing this amount of sequence data without random sampling is challenging. Major difficulties for downstream analyses include the increase in the dimension size along with the conversion of sequences into numerical values when using conventional amino acid representation methods, such as one-hot encoding andk-mer-based approaches that directly reflect sequences. Moreover, these sequences are deficient in physicochemical characteristics, such as structural information and hydrophilicity; hence, they fail to accurately represent the inherent function of the given sequences. In this study, we utilized the physicochemical properties of amino acids to develop a rapid and efficient approach for extracting feature parameters that are suitable for downstream processes of machine learning, such as clustering. A fixed-length feature vector representation of a spike sequence with reduced dimensionality was obtained by converting amino acid residues into physicochemical parameters. Next, t-distributed stochastic neighbor embedding (t- SNE), a method for dimensionality reduction and visualization of high-dimensional data, was performed, followed by density-based spatial clustering of applications with noise (DBSCAN). The results show that by using the physicochemical properties of amino acids rather than conventional methods that directly represent sequences into numerical values, SARS-CoV-2 spike sequences can be clustered with sufficient accuracy and a shorter runtime. Interestingly, the clusters obtained by using amino acid properties include subclusters that are distinct from those produced utilizing the method for the direct representation of amino acid sequences. A more detailed analysis indicated that the contributing parameters of this novel cluster identified exclusively when utilizing the physicochemical properties of amino acids significantly differ from one another. This suggests that representing amino acid sequences by physicochemical properties might enable the identification of clusters with enhanced sensitivity compared to conventional methods.Author summaryOne of the major causes of the global threat of SARS-CoV-2 is the rapid emergence of its variants. While analyzing these variants is crucial for understanding the mechanism of outbreaks, the expansion of database size is becoming a barrier for effective analysis. In this study, we provide an approach that allows researchers without vast computational resources to comprehensively analyze the variants of SARS-CoV-2 spike by representing the sequences using the physicochemical properties of amino acids. The result of clusters derived using this method demonstrates not only an accuracy comparable to the conventional approaches of directly converting sequences into numerical values but also indicates the potential for more detailed clustering outcomes. The results suggest that our approach is valuable for the rapid identification of characteristic residues in new variants of SARS-CoV-2 and other viruses that may arise in the future.
Publisher
Cold Spring Harbor Laboratory