Abstract
Abstract
Background
Analysis of heterogeneous populations such as viral quasispecies is one of the most challenging bioinformatics problems. Although machine learning models are becoming to be widely employed for analysis of sequence data from such populations, their straightforward application is impeded by multiple challenges associated with technological limitations and biases, difficulty of selection of relevant features and need to compare genomic datasets of different sizes and structures.
Results
We propose a novel preprocessing approach to transform irregular genomic data into normalized image data. Such representation allows to restate the problems of classification and comparison of heterogeneous populations as image classification problems which can be solved using variety of available machine learning tools. We then apply the proposed approach to two important problems in molecular epidemiology: inference of viral infection stage and detection of viral transmission clusters using next-generation sequencing data. The infection staging method has been applied to HCV HVR1 samples collected from 108 recently and 257 chronically infected individuals. The SVM-based image classification approach achieved more than 95% accuracy for both recently and chronically HCV-infected individuals. Clustering has been performed on the data collected from 33 epidemiologically curated outbreaks, yielding more than 97% accuracy.
Conclusions
Sequence image normalization method allows for a robust conversion of genomic data into numerical data and overcomes several issues associated with employing machine learning methods to viral populations. Image data also help in the visualization of genomic data. Experimental results demonstrate that the proposed method can be successfully applied to different problems in molecular epidemiology and surveillance of viral diseases. Simple binary classifiers and clustering techniques applied to the image data are equally or more accurate than other models.
Funder
National Institutes of Health
National Science Foundation
Publisher
Springer Science and Business Media LLC
Reference45 articles.
1. Sanjuan R, Nebot MR, Chirico N, Mansky LM, Belshaw R. Viral mutation rates. J Virol. 2010;84(19):9733–48.
2. Apostolou A, Bartholomew ML, Greeley R, Guilfoyle SM, Gordon M, Genese C, Davis JP, Montana B, Borlaug G. Transmission of hepatitis c virus associated with surgical procedures-new jersey 2010 and wisconsin 2011. MMWR Morb Mortal Wkly Rep. 2015;64(7):165–70.
3. Campo DS, Skums P, Dimitrova Z, Vaughan G, Forbi JC, Teo C-G, Khudyakov Y, Lau DTY. Drug resistance of a viral population and its individual intrahost variants during the rst 48 hours of therapy. Clin Pharmacol Ther. 2014;95(6):627–35.
4. Lengauer T, Rahnenfuehrer J, Roomp K, Beerenwinkel N, Sing T. Computational methods for the design of e ective therapies against drug resistant HIV strains. Bioinformatics. 2005;21:3943–50.
5. Rhee S-Y, Liu TF, Holmes SP, Shafer RW. HIV-1 subtype B protease and reverse transcriptase amino acid covariation. PLoS Comput Biol. 2007;3(5):e87.
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献