Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

Author:

Catanach Therese A.123,Sweet Andrew D.24,Nguyen Nam-phuong D.5,Peery Rhiannon M.67,Debevec Andrew H.8,Thomer Andrea K.9,Owings Amanda C.10,Boyd Bret M.211,Katz Aron D.212,Soto-Adames Felipe N.1314,Allen Julie M.15

Affiliation:

1. Ornithology Department, Academy of Natural Sciences of Drexel University, Philadelphia, PA, United States of America

2. Illinois Natural History Survey, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America

3. Department of Wildlife and Fisheries Sciences, Texas A&M University, College Station, TX, United States of America

4. Department of Entomology, Purdue University, West Lafayette, IN, United States of America

5. Computer Science and Engineering, University of San Diego, California, La Jolla, CA, United States of America

6. Department of Biology, University of Alberta, Edmonton, Alberta, Canada

7. Department of Plant Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America

8. School of Integrative Biology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America

9. School of Information, University of Michigan—Ann Arbor, Ann Arbor, MI, United States of America

10. Program in Ecology, Evolution, and Conservation Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States of America

11. Department of Entomology, University of Georga, Athens, GA, United States of America

12. Department of Entomology, University of Illinois at Urbana-Champaign, Champaign, IL, United States of America

13. Florida State Collection of Arthropods, Florida Department of Agriculture and Consumer Services, Gainesville, FL, United States of America

14. Department of Entomology and Nematology, University of Florida, Gainesville, FL, United States of America

15. Biology Department, University of Nevada, Reno, Reno, NV, United States of America

Abstract

Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected “by eye” prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.

Funder

National Science Foundation

Extreme Science and Engineering Discovery Environment

Publisher

PeerJ

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience

Reference72 articles.

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3