Abstract
ABSTRACTStructural variants (SV) are polymorphisms defined by their length (>50 bp). The usual types of SVs are deletions, insertions, translocations, inversions, and copy number variants. SV detection and genotyping is fundamental given the role of SVs in phenomena such as phenotypic variation and evolutionary events. Thus, methods to identify SVs using long read sequencing data have been recently developed. We present an accurate and efficient algorithm to predict SVs from long-read sequencing data. The algorithm starts collecting evidence (Signatures) of SVs from read alignments. Then, signatures are clustered based on a Euclidean graph with coordinates calculated from lengths and genomic positions. Clustering is performed by the DBSCAN algorithm, which provides the advantage of delimiting clusters with high resolution. Clusters are transformed into SVs and a Bayesian model allows to precisely genotype SVs based on their supporting evidence. This algorithm is integrated in the single sample variants detector of the Next Generation Sequencing Experience Platform (NGSEP), which facilitates the integration with other functionalities for genomics analysis. For benchmarking, our algorithm is compared against different tools using VISOR for simulation and the GIAB SV dataset for real data. For indel calls in a 20x depth Nanopore simulated dataset, the DBSCAN algorithm performed better, achieving an F-score of 98%, compared to 97.8 for Dysgu, 97.8 for SVIM, 97.7 for CuteSV, and 96.8 for Sniffles. We believe that this work makes a significant contribution to the development of bioinformatic strategies to maximize the use of long read sequencing technologies.
Publisher
Cold Spring Harbor Laboratory