Author:
Hijikata Atsushi,Suyama Mikita,Kikugawa Shingo,Matoba Ryo,Naruto Takuya,Enomoto Yumi,Kurosawa Kenji,Harada Naoki,Yanagi Kumiko,Kaname Tadashi,Miyako Keisuke,Takazawa Masaki,Sasai Hideo,Hosokawa Junichi,Itoga Sakae,Yamaguchi Tomomi,Kosho Tomoki,Matsubara Keiko,Kuroki Yoko,Fukami Maki,Adachi Kaori,Nanba Eiji,Tsuchida Naomi,Uchiyama Yuri,Matsumoto Naomichi,Nishimura Kunihiro,Ohara Osamu
Abstract
AbstractNext-generation DNA sequencing (NGS) in short-read mode has been recently used for genetic testing in various clinical settings. NGS data accuracy is crucial in clinical settings, and several reports regarding quality control of NGS data, focusing mostly on establishing NGS sequence read accuracy, have been published thus far. Variant calling is another critical source of NGS errors that remains mostly unexplored despite its established significance. In this study, we used a machine-learning-based method to establish an exome-wide benchmark of difficult-to-sequence regions using 10 genome sequence features on the basis of real-world NGS data accumulated in The Genome Aggregation Database (gnomAD) of the human reference genome sequence (GRCh38/hg38). We used the obtained metrics, designated “UNMET score,” along with other lines of structural information of the human genome to identify difficult-to-sequence genomic regions using conventional NGS. Thus, the UNMET score could provide appropriate caveats to address potential sequential errors in protein-coding exons of the human reference genome sequence GRCh38/hg38 in clinical sequencing.
Publisher
Cold Spring Harbor Laboratory