Abstract
AbstractCurrently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. However, previous studies suggest that current approaches of haplotype reconstruction greatly underestimate intra-host diversity. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. Parameters for the simulated data spanned known fast evolving viruses (e.g., HIV-1) diversity estimates to test the limits of the haplotype reconstruction methods and ensured coverage of predicted intra-host viral diversity levels. Using those parameters, we simulated HIV-1 viral populations of 216-1,185 haplotypes per host at a frequency <7%. All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction accuracy was highly variable and, on average, poor. High diversity levels led to severe underestimation of, with a few tools greatly overestimating, the true number of haplotypes. PredictHaplo and PEHaplo produced estimates close to the true number of haplotypes, although their haplotype reconstruction accuracy was worse than that of the other ten tools. We conclude that haplotype reconstruction from NGS short reads is unreliable due to high genetic diversity of fast-evolving viruses. Local haplotype reconstruction of longer reads to phase variants may provide a more reliable estimation of viral variants within a population.HighlightsHaplotype callers for NGS data vary greatly in their performance.Haplotype callers performance is mainly determined by mutation rate.Haplotype callers performance is less sensitive to effective population size.Most haplotype callers perform well with low diversity and poorly with high diversity.PredictHaplo performs best if genetic diversity is in the range of HIV diversity.
Publisher
Cold Spring Harbor Laboratory
Reference85 articles.
1. aBayesQR: A Bayesian Method for Reconstruction of Viral Populations Characterized by Low Diversity;J. Comput. Biol,2017
2. Simulation of Genome-Wide Evolution under Heterogeneous Substitution Models and Complex Multispecies Coalescent Histories
3. Long Single-Molecule Reads Can Resolve the Complexity of the Influenza Virus Composed of Rare, Closely Related Mutant Variants;J. Comput. Biol,2017
4. Astrovskaya, I. , Tork, B. , Mangul, S. , Westbrooks, K. , Mǎndoiu, I. , Balfe, P. , Zelikovsky, A. , 2011. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC Bioinformatics 12. https://doi.org/10.1186/1471-2105-12-S6-S1