Improving bacterial genome assembly using a test of strand orientation
Author:
Greenberg Grant1,
Shomorony Ilan1
Affiliation:
1. Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign
Abstract
Abstract
Summary
The complexity of genome assembly is due in large part to the presence of repeats. In particular, large reverse-complemented repeats can lead to incorrect inversions of large segments of the genome. To detect and correct such inversions in finished bacterial genomes, we propose a statistical test based on tetranucleotide frequency (TNF), which determines whether two segments from the same genome are of the same or opposite orientation. In most cases, the test neatly partitions the genome into two segments of roughly equal length with seemingly opposite orientations. This corresponds to the segments between the DNA replication origin and terminus, which were previously known to have distinct nucleotide compositions. We show that, in several cases where this balanced partition is not observed, the test identifies a potential inverted misassembly, which is validated by the presence of a reverse-complemented repeat at the boundaries of the inversion. After inverting the sequence between the repeat, the balance of the misassembled genome is restored. Our method identifies 31 potential misassemblies in the NCBI database, several of which are further supported by a reassembly of the read data.
Availability and implementation
A github repository is available at https://github.com/gcgreenberg/Oriented-TNF.git.
Supplementary information
Supplementary data are available at Bioinformatics online.
Funder
Greenberg and Ilan Shomorony
National Science Foundation CAREER Award
Publisher
Oxford University Press (OUP)
Subject
Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability
Reference34 articles.
1. Spades: a new genome assembly algorithm and its applications to single-cell sequencing;Bankevich;J. Comput. Biol,2012
2. A review of methods and databases for metagenomic classification and assembly;Breitwieser;Brief. Bioinformat,2017
3. Nonhybrid, finished microbial genome assemblies from long-read smrt sequencing data;Chin;Nat. Methods,2013
4. Genbank;Clark;Nucleic Acids Res,2016