Abstract
AbstractIn this study, we investigate the impact of introns on the effectiveness of splice site prediction using deep learning models, focusing onArabidopsis thaliana. We specifically utilize U2-type introns due to their ubiquity in plant genomes and the rich datasets available. We formulate two hypotheses: first, that short introns would lead to a higher effectiveness of splice site prediction than long introns due to reduced spatial complexity; and second, that sequences containing multiple introns would improve prediction effectiveness by providing a richer context for splicing events. Our findings indicate that (1) models trained on datasets with shorter introns consistently outperform those trained on datasets with longer introns, highlighting the importance of intron length in splice site prediction, and (2) models trained with datasets containing multiple introns per sequence demonstrate superior effectiveness over those trained with datasets containing a single intron per sequence. Furthermore, our findings not only align with the two hypotheses we put forward but also confirm existing observations from wet lab experiments regarding the impact of length of an intron and the number of introns present in a sequence on splice site prediction effectiveness, suggesting that our computational insights come with biological relevance.Author summaryIn this study, we explore how intron characteristics affect the effectiveness of splice site predictions inArabidopsis thalianausing deep learning. In particular, focusing on U2-type introns due to their prevalence in plant genomes and their relevance for large-scale data analysis, we demonstrate that both the length of these introns and the number of introns present in a sequence substantially influence prediction outcomes. Our findings highlight that deep learning models trained on data with shorter introns or multiple introns per sequence produce better predictions, aligning with observations from wet lab experiments regarding the impact of intron length and the number of introns per sequences on splice site prediction effectiveness.
Publisher
Cold Spring Harbor Laboratory