Abstract
Abstract“Fast is fine, but accuracy is final.”-- Wyatt EarpBackgroundThe extreme diversity of newly sequenced organisms and considerable scale of modern sequence databases lead to a tension between competing needs for sensitivity and speed in sequence annotation, with multiple tools displacing the venerable BLAST software suite on one axis or another. Alignment based on profile hidden Markov models (pHMMs) has demonstrated state of art sensitivity, while recent algorithmic advances have resulted in hyper-fast annotation tools with sensitivity close to that of BLAST.ResultsHere, we introduce a new tool that bridges the gap between advances in these two directions, reaching speeds comparable to fast annotation methods such as MMseqs2 while retaining most of the sensitivity offered by pHMMs. The tool, callednail, implements a heuristic approximation of the pHMM Forward/Backward (FB) algorithm by identifying a sparse subset of the cells in the FB dynamic programming matrix that contains most of the probability mass. The method produces an accurate approximation of pHMM scores and E-values with high speed and small memory requirements. On a protein benchmark,nailrecovers the majority of recall difference between MMseqs2 and HMMER, with run time ∼26x faster than HMMER3 (only ∼2.4x slower than MMseqs2’s sensitive variant).nailis released under the open BSD-3-clause license and is available for download athttps://github.com/TravisWheelerLab/nail.
Publisher
Cold Spring Harbor Laboratory