Application of Continuous Embedding of Viral Genome Sequences and Machine Learning in the Prediction of SARS-CoV-2 Variants

1Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Since the beginning of the novel coronavirus pandemic, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has spread to 224 countries with over 430 million confirmed cases and more than 5,97 million deaths worldwide. One of the crucial reasons why the spread of the virus was difficult to stop was the viral evolution over time. The emergence of new virus variants is hindering the development of effective drugs and vaccines. Moreover, they contribute e.g. to virus transmissibility or viral immune evasion. This fact has led to increased importance of understanding genomic data related to SARS-CoV-2. In this study, we are proposing sarscov2vec, a new application of continuous vector space representation on novel species of coronaviruses genomes. With its core methodology of genome feature extraction step and being supervised by a Machine Learning model, this tool is designed to distinguish the most common five different SARS-CoV-2 variants: Alpha, Beta, Delta, Gamma and Omicron. In this research we used 367,004 unique genome sequence records from the official virus repositories, where 25,000 sequences were randomly selected and used to train the Natural Language Processing (NLP) algorithm. The next 36,365 samples were processed by a Machine Learning pipeline. Our research results show that the final hiper-tuned classification model achieved 99% of accuracy on the test set. Furthermore, this study demonstrated that the continuous vector space representation of SARS-CoV-2 genomes can be decomposed into 2D vector space and visualized as a method of explaining Machine Learning model decisions.

Cite

CITATION STYLE

APA

Tynecki, P., & Lubocki, M. (2022). Application of Continuous Embedding of Viral Genome Sequences and Machine Learning in the Prediction of SARS-CoV-2 Variants. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13293 LNCS, pp. 284–298). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-10539-5_21

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free