Abstract
SummaryNumerical or vector representations of DNA sequences have been applied for identification of specific sequence characteristics and patterns which are not evident in their character (A, C, G, T) representations. These transformations often reveal a mathematical structure to the sequences which can be captured efficiently using established mathematical methods. One such transformation, the 2-bit format, represents each nucleotide using only two bits instead of eight for efficient storage of genomic data. Here we describe a mathematical property that exists in the 2-bit representation of tandemly repeated DNA sequences. Our tool, DiviSSR (pronounced divisor), leverages this property and subsequent arithmetic to achieve ultrafast and accurate identification of tandem repeats. DiviSSR can process the entire human genome in ∼30s, and short sequence reads at a rate of >1 million reads/s on a single CPU thread. Our work also highlights the implications of using simple mathematical properties of DNA sequences for faster algorithms in genomics.AvailabilityDiviSSR can be installed directly using python pip. The source code and documentation of DiviSSR are available at https://github.com/avvaruakshay/divissr.git.
Publisher
Cold Spring Harbor Laboratory