Semi-automated assembly of high-quality diploid human reference genomes
Author:
Jarvis Erich D.ORCID, Formenti GiulioORCID, Rhie Arang, Guarracino AndreaORCID, Yang Chentao, Wood JonathanORCID, Tracey AlanORCID, Thibaud-Nissen FrancoiseORCID, Vollger Mitchell R.ORCID, Porubsky DavidORCID, Cheng Haoyu, Asri Mobin, Logsdon Glennis A.ORCID, Carnevali Paolo, Chaisson Mark J. P.ORCID, Chin Chen-Shan, Cody Sarah, Collins Joanna, Ebert PeterORCID, Escalona MerlyORCID, Fedrigo OlivierORCID, Fulton Robert S., Fulton Lucinda L., Garg Shilpa, Gerton Jennifer L.ORCID, Ghurye Jay, Granat Anastasiya, Green Richard E.ORCID, Harvey William, Hasenfeld PatrickORCID, Hastie Alex, Haukness MarinaORCID, Jaeger Erich B., Jain Miten, Kirsche Melanie, Kolmogorov MikhailORCID, Korbel Jan O.ORCID, Koren SergeyORCID, Korlach JonasORCID, Lee JoyceORCID, Li DaofengORCID, Lindsay Tina, Lucas Julian, Luo FengORCID, Marschall TobiasORCID, Mitchell Matthew W., McDaniel JenniferORCID, Nie Fan, Olsen Hugh E., Olson Nathan D.ORCID, Pesout Trevor, Potapova Tamara, Puiu Daniela, Regier Allison, Ruan Jue, Salzberg Steven L., Sanders Ashley D., Schatz Michael C., Schmitt Anthony, Schneider Valerie A., Selvaraj Siddarth, Shafin KishwarORCID, Shumate Alaina, Stitziel Nathan O.ORCID, Stober Catherine, Torrance James, Wagner Justin, Wang JianxinORCID, Wenger Aaron, Xiao ChuanleORCID, Zimin Aleksey V., Zhang GuojieORCID, Wang TingORCID, Li HengORCID, Garrison Erik, Haussler DavidORCID, Hall Ira, Zook Justin M., Eichler Evan E.ORCID, Phillippy Adam M.ORCID, Paten Benedict, Howe KerstinORCID, Miga Karen H.,
Abstract
AbstractThe current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
Publisher
Springer Science and Business Media LLC
Subject
Multidisciplinary
Reference90 articles.
1. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). 2. Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017). 3. Sherman, R. M. & Salzberg, S. L. Pan-genomics in the human genome era. Nat. Rev. Genet. 21, 243–254 (2020). 4. Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597–614 (2020). 5. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Cited by
74 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|