Since the beginning of the novel coronavirus pandemic, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has spread to 224 countries with over 430 million confirmed cases and more than 5,97 million deaths worldwide. One of the crucial reasons why the spread of the virus was difficult to stop was the viral evolution over time. The emergence of new virus variants is hindering the development of effective drugs and vaccines. Moreover, they contribute e.g. to virus transmissibility or viral immune evasion. This fact has led to increased importance of understanding genomic data related to SARS-CoV-2. In this study, we are proposing sarscov2vec, a new application of continuous vector space representation on novel species of coronaviruses genomes. With its core methodology of genome feature extraction step and being supervised by a Machine Learning model, this tool is designed to distinguish the most common five different SARS-CoV-2 variants: Alpha, Beta, Delta, Gamma and Omicron. In this research we used 367,004 unique genome sequence records from the official virus repositories, where 25,000 sequences were randomly selected and used to train the Natural Language Processing (NLP) algorithm. The next 36,365 samples were processed by a Machine Learning pipeline. Our research results show that the final hiper-tuned classification model achieved 99% of accuracy on the test set. Furthermore, this study demonstrated that the continuous vector space representation of SARS-CoV-2 genomes can be decomposed into 2D vector space and visualized as a method of explaining Machine Learning model decisions.
CITATION STYLE
Tynecki, P., & Lubocki, M. (2022). Application of Continuous Embedding of Viral Genome Sequences and Machine Learning in the Prediction of SARS-CoV-2 Variants. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13293 LNCS, pp. 284–298). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-10539-5_21
Mendeley helps you to discover research relevant for your work.