Author:
Heldenbrand Jacob R.,Baheti Saurabh,Bockol Matthew A.,Drucker Travis M.,Hart Steven N.,Hudson Matthew E.,Iyer Ravishankar K.,Kalmbach Michael T.,Klee Eric W.,Wieben Eric D.,Wiepert Mathieu,Wildman Derek E.,Mainzer Liudmila S.
Abstract
AbstractUse of the Genome Analysis Toolkit (GATK) continues to be the standard practice in genomic variant calling in both research and the clinic. Recently the toolkit has been rapidly evolving. Significant computational performance improvements have been introduced in GATK3.8 through collaboration with Intel in 2017. The first release of GATK4 in early 2018 revealed significant rewrites in the code base, as the stepping stone toward a Spark implementation. As the software continues to be a moving target for optimal deployment in highly productive environments, we present a detailed analysis of these improvements, to help the community stay abreast with changes in performance. We re-evaluated the options previously identified as advantageous, such as threading, parallel garbage collection, I/O options and data-level parallelization. Based on our results, we consider the performance and cost trade-offs of using GATK3.8 and GATK4 for different types of analyses.
Publisher
Cold Spring Harbor Laboratory
Reference5 articles.
1. A framework for variation discovery and genotyping using next-generation DNA sequencing data
2. From fastq data to high confidence variant calls: the genome analysis toolkit best practices pipeline;Curr Protoc Bioinformatics,2013
3. Heng Li . Aligning sequence reads, clone sequences and assembly contigs with bwa-mem, 2013.
4. NOVOCRAFT TECHNOLOGIES SDN BHD. Novocraft, 2014.
5. Accelerating next generation sequencing data analysis with system level optimizations;Scientific Reports,2017