Abstract
AbstractThe amount of data stored in genomic sequence databases is growing exponentially, far exceeding traditional indexing strategies’ processing capabilities. Many recent indexing methods organize sequence data into a sequence graph to succinctly represent large genomic data sets from reference genome and sequencing read set databases. These methods typically use De Bruijn graphs as the graph model or the underlying index model, with auxiliary graph annotation data structures to associate graph nodes with various metadata. Examples of metadata can include a node’s source samples (calledlabels), genomic coordinates, expression levels, etc.An important property of these graphs is that the set of sequences spelled by graph walks is a superset of the set of input sequences. Thus, when aligning to graphs indexing samples derived from low-coverage sequencing sets, sequence information present in many target samples can compensate for missing information resulting from a lack of sequence coverage. Aligning a query against an entire sequence graph (as in traditional sequence-to-graph alignment) using state-of-the-art algorithms can be computationally intractable for graphs constructed from thousands of samples, potentially searching through many non-biological combinations of samples before converging on the best alignment.To address this problem, we propose a novel alignment strategy calledmulti-label alignment(MLA) and an algorithm implementing this strategy using annotated De Bruijn graphs within the MetaGraph framework, calledMetaGraph-MLA. MLA extends current sequence alignment scoring models with additionallabel changeoperations for incorporating mixtures of samples into an alignment, penalizing mixtures that are dissimilar in their sequence content. To overcome disconnects in the graph that result from a lack of sequencing coverage, we further extend our graph index to utilize a variable-order De Bruijn graph and introducenode length changeas an operation. In this model, traversal between nodes that share a suffix of length <k– 1 acts as a proxy for inserting nodes into the graph.MetaGraph-MLAconstructs an MLA of a query by chaining single-label alignments using sparse dynamic programming. We evaluateMetaGraph-MLAon simulated data against state-of-the-art sequence-to-graph aligners. We demonstrate increases in alignment lengths for simulated viral Illumina-type (by 6.5%), PacBio CLR-type (by 6.2%), and PacBio CCS-type (by 6.7%) sequencing reads, respectively, and show that the graph walks incorporated into our MLAs originate predominantly from samples of the same strain as the reads’ ground-truth samples. We envisionMetaGraph-MLAas a step towards establishing sequence graph tools for sequence search against a wide variety of target sequence types.
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献