Abstract
AbstractRibosome profiling is a deep sequencing technique used to chart translation by means of mRNA ribosome occupancy. It has been instrumental in the detection of non-canonical coding sequences. Because of the complex nature of next-generation sequencing data, existing solutions that seek to identify translated open reading frames from the data are still not perfect. We propose RIBO-former, a new approach featuring several innovations for thede novoannotation of translated coding sequences. RIBO-former is built using recent transformer models that have achieved considerable advancements in the field of natural language processing. The presented deep learning approach allows to omit several pre-processing steps as features are automatically extracted from the data. We discuss various steps that improve the detection of coding sequences and show that read length information of all mapped reads can be leveraged to improve the predictive performance of the tool. Our results show RIBO-former to outperform previous methodologies. Additionally, through our study we find support for the existence of translated non-canonical ORFs, present along existing coding sequences or on long non-coding RNAs. Furthermore, several polycistronic mRNAs with multiple translated coding regions were detected.
Publisher
Cold Spring Harbor Laboratory