Author:
Vasimuddin Md,Misra Sanchit,Aluru Srinivas
Abstract
AbstractRapid advances in next-generation sequencing technologies are improving the throughput and cost of sequencing at a rate significantly faster than the Moore’s law. This necessitates equivalent rate of acceleration of NGS secondary analysis that assembles reads into full genomes and identifies variants between genomes. Conventional improvement in hardware can at best help accelerate this according to the Moore’s law. Moreover, a majority of the software tools used for secondary analysis do not use the hardware efficiently. Therefore, we need hardware that is designed taking into account the computational requirements of secondary analysis, along with software tools that use it efficiently. Here, we take the first step towards these goals by identifying the computational requirements of secondary analysis. We surveyed dozens of software tools from all the three major problems in secondary analysis – sequence mapping, De novo assembly, and variant calling – to select seven popular tools and a workflow for an in-depth analysis. We performed runtime profiling of the tools using multiple real datasets to find that the majority of the runtime is dominated by just four building blocks – Smith-Waterman alignment, FM-index based sequence search, Debruijn graph construction and traversal, and pairwise hidden markov model algorithm – covering 80.5%-98.2%, 63.9%-99.4% and 72%-93% of the runtime, respectively, for sequence mapping, De novo assembly, and variant calling. The key outcome of this result is that by just targeting software and hardware optimizations to these building blocks, major performance improvements for NGS secondary analysis can be achieved.
Publisher
Cold Spring Harbor Laboratory
Reference64 articles.
1. Korean genome project. url = http://koreangenome.org. Accessed: January 2018.
2. UC Davis Genome Center: The assemblathon competitions. url = http://assemblathon.org/. Accessed: August 2017.
3. NIH awards $55 million to build million-person precision medicine study. NIH News Releases, 6 July, 2016.
4. Replacing suffix trees with enhanced suffix arrays
5. A. Bankevich , S. Nurk , D. Antipov , and et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19, 2012.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献