Abstract
AbstractTandem repeats (TRs) are contiguously repetitive sequences with a high mutation rate. Several human diseases have been associated with an expansion of TR, a mutation which constitutes a change in their number of repetitions. Nevertheless, these Variable Number Tandem Repeats (VNTRs) have not been included in many genome-wide studies. The reason is that VNTR genotyping is inaccurate using short-read sequencing while new technology like long-read sequencing is expensive and lacks throughput.Here, we propose a sequence based random forest classifier that is able to predict variable expansion of TR regions, given by incomplete VNTR annotation from long-read sequencing of 5 haplotypes. The classifier mainly predicted VNTRs using the features TR length. The second most used feature is a novel finding: the Mfold predicted likelihood of self-folding for which more stable foldings are correlated with VNTRs. We validated VNTR candidates predicted by this classifier by clustering short-read pileup patterns compared across 17 genomes. TRs labeled VNTR by the classifier showed similar local variance in their pileup profiles.Contactdiederik.cvb@gmail.comSupplementary informationSupplementary data are available at bioRxiv
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献