Abstract
AbstractTransformer models have achieved excellent results in various tasks, primarily due to the self-attention mechanism. We explore using self-attention for detecting coronavirus sequences in high-throughput sequencing data, offering a novel approach for accurately identifying emerging and highly variable coronavirus strains. Coronavirus and human genome data were obtained from the Genomic Data Commons (GDC) and the National Genomics Data Center (NGDC) databases. After preprocessing, a simulated high-throughput sequencing dataset of coronavirus-infected samples was constructed. This dataset was divided into training, validation, and test datasets. The self-attention-based model was trained on the training datasets, tested on the validation and test datasets, and SARS-CoV-2 genome data were collected as an independent test datasets. The results showed that the self-attention-based model outperformed traditional bioinformatics methods in terms of performance on both the test and the independent test datasets, with a significant improvement in computation speed. The self-attention-based model can sensitively and rapidly detect coronavirus sequences from high-throughput sequencing data while exhibiting excellent generalization ability. It can accurately detect emerging and highly variable coronavirus strains, providing a new approach for identifying such viruses.
Publisher
Cold Spring Harbor Laboratory