Author:
Ikemoto Ko,Fujimoto Hinano,Fujimoto Akihiro
Abstract
AbstractBackgroundLong-read sequencing technologies have the potential to overcome the limitations of short reads and provide a comprehensive picture of the human genome. However, it remains hard to characterize repetitive sequences by reconstructing genomic structures at high resolution solely from long reads. Here, we developed a localized assembly method (LoMA) that constructs highly accurate consensus sequences (CSs) from long reads.MethodsWe first developed LoMA, by combining minimap2, MAFFT, and our algorithm, which classifies diploid haplotypes based on structural variants and constructs CSs. Using this tool, we analyzed two human samples (NA18943 and NA19240) sequenced with the Oxford Nanopore sequencer. We defined target regions in each genome based on mapping patterns and then constructed a high-quality catalog of the human insertion solely from the long-read data.ResultsThe assessment of LoMA showed high accuracy of CSs (error rate < 0.3%) compared with raw data (error rate > 8%) and superiority to the previous study. The genome-wide analysis of NA18943 and NA19240 identified 5,516 and 6,542 insertions (ζ 100 bp) respectively. Most insertions (∼80%) were derived from the tandem repeat and transposable elements. We also detected processed pseudogenes, insertions in transposable elements, and long insertions (> 10 kbp). Further, our analysis suggested that short tandem duplications were association with gene expression and transposons.ConclusionsOur analysis showed that LoMA constructs high-quality sequences from long reads with substantial errors. This study revealed the true structures of insertions with high accuracy and inferred mechanisms for the insertions. Our approach contributes to the future human genome studies. LoMA is available at our GitHub page:https://github.com/kolikem/loma.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献