Improved detection of low-frequency within-host variants from deep sequencing: A case study with human papillomavirus

Author:

Mishra Sambit K12ORCID,Nelson Chase W1ORCID,Zhu Bin1,Pinheiro Maisa1,Lee Hyo Jung12,Dean Michael1ORCID,Burdett Laurie12,Yeager Meredith12,Mirabello Lisa1

Affiliation:

1. Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health , 9609 Medical Center Drive, Rockville, MD 20850, USA

2. Cancer Genomics Research Laboratory, Leidos Biomedical Research, Inc., Frederick National Laboratory for Cancer Research , P.O. Box B, Bldg. 430, Frederick, MD 21702, USA

Abstract

Abstract High-coverage sequencing allows the study of variants occurring at low frequencies within samples, but is susceptible to false-positives caused by sequencing error. Ion Torrent has a very low single nucleotide variant (SNV) error rate and has been employed for the majority of human papillomavirus (HPV) whole genome sequences. However, benchmarking of intrahost SNVs (iSNVs) has been challenging, partly due to limitations imposed by the HPV life cycle. We address this problem by deep sequencing three replicates for each of 31 samples of HPV type 18 (HPV18). Errors, defined as iSNVs observed in only one of three replicates, are dominated by C→T (G→A) changes, independently of trinucleotide context. True iSNVs, defined as those observed in all three replicates, instead show a more diverse SNV type distribution, with particularly elevated C→T rates in CCG context (CCG→CTG; CGG→CAG) and C→A rates in ACG context (ACG→AAG; CGT→CTT). Characterization of true iSNVs allowed us to develop two methods for detecting true variants: (1) VCFgenie, a dynamic binomial filtering tool which uses each variant’s allele count and coverage instead of fixed frequency cut-offs; and (2) a machine learning binary classifier which trains eXtreme Gradient Boosting models on variant features such as quality and trinucleotide context. Each approach outperforms fixed-cut-off filtering of iSNVs, and performance is enhanced when both are used together. Our results provide improved methods for identifying true iSNVs in within-host applications across sequencing platforms, specifically using HPV18 as a case study.

Funder

Division of Cancer Epidemiology and Genetics, National Cancer Institute

Publisher

Oxford University Press (OUP)

Reference35 articles.

1. A Deep Learning Approach to Automate Refinement of Somatic Variant Calling from Cancer Sequencing Data;Ainscough;Nature Genetics,2018

2. Human Papillomavirus Genome Variants;Burk;Virology,2013

3. Human Papillomavirus (HPV) Genotypes in Women with Cervical Precancer and Cancer at Kaiser Permanente Northern California;Castle;Cancer Epidemiology Biomarkers and Prevention,2011

4. K-mer Analyses Reveal Different Evolutionary Histories of Alpha, Beta, and Gamma Papillomaviruses;Chen;International Journal of Molecular Sciences,2021

5. XGBoost: A Scalable Tree Boosting System;Chen,2016

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3