Affiliation:
1. Department of Molecular Genetics, University of Toronto , Toronto , ON M5S 1A8, Canada
2. Terrence Donnelly Centre for Cellular & Biomolecular Research , Toronto , ON M5S 3E1, Canada
Abstract
Abstract
Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using standard discriminative models. We show that models comprised only of the five primary CPA sequence features give highest probability scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 59% of all human genes. U1-hybridizing sequences provide a small boost in performance. The addition of all known RBP RNA binding motifs to the model increases this figure to only 61%, suggesting that additional factors beyond the core CPA machinery have a minimal role in delineating real from cryptic sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented.
Funder
Canadian Institutes of Health Research
Publisher
Oxford University Press (OUP)
Subject
Applied Mathematics,Computer Science Applications,Genetics,Molecular Biology,Structural Biology
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献