Abstract
AbstractMotivationThe quality of reference genomes critically affects analyses of next generation sequencing experiments. During the construction of the reference genome, contigs are organized into their underlying chromosomes in the scaffolding step. Historically, the quality of scaffolding software has been difficult to evaluate in a systematic and quantitative fashion. To this end, we identified genomic edit distance as a compelling method for evaluating the quality of a scaffold.ResultsWe present Edison, a Python implementation of the Double Cut and Join (DCJ) edit distance algorithm. Edison calculates the overall accuracy of a given scaffold using a reference genome and also provides scores for characterizing different aspects of the scaffolding accuracy, including grouping, ordering, and orientation. All metrics are calculated on a length-weighted basis, which rewards the correct placement of longer contigs over shorter ones. By creating 1000 random assemblies of the S. cerevisiae genome, we show that our scaffolding accuracy provides a more reliable metric than the commonly used metric, N50. Edison can be used to benchmark new scaffolding algorithms, providing insights into the strengths and weaknesses of each approach.Availability and ImplementationEdison is available under an MIT license at https://github.com/Noble-Lab/edison.
Publisher
Cold Spring Harbor Laboratory