The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity

Author:

Reese FairlieORCID,Williams BrianORCID,Balderrama-Gutierrez GabrielaORCID,Wyman DanaORCID,Çelik Muhammed HasanORCID,Rebboah ElisabethORCID,Rezaie NargesORCID,Trout DianeORCID,Razavi-Mohseni MiladORCID,Jiang YunzheORCID,Borsari BeatriceORCID,Morabito SamuelORCID,Liang Heidi YahanORCID,McGill Cassandra J.ORCID,Rahmanian SorenaORCID,Sakr JasmineORCID,Jiang ShanORCID,Zeng Weihua,Carvalho KlebeaORCID,Weimer Annika K.ORCID,Dionne Louise A.,McShane ArielORCID,Bedi KaranORCID,Elhajjajy Shaimae I.ORCID,Upchurch SeanORCID,Jou JenniferORCID,Youngworth IngridORCID,Gabdank IdanORCID,Sud PaulORCID,Jolanki OttoORCID,Strattan J. SethORCID,Kagda Meenakshi S.ORCID,Snyder Michael P.ORCID,Hitz Ben C.ORCID,Moore Jill E.ORCID,Weng ZhipingORCID,Bennett DavidORCID,Reinholdt LauraORCID,Ljungman MatsORCID,Beer Michael A.ORCID,Gerstein Mark B.ORCID,Pachter LiorORCID,Guigó RodericORCID,Wold Barbara J.ORCID,Mortazavi AliORCID

Abstract

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3’ end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains.To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3’ processing are deployed across human tissues, with nearly half of multitranscript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

Publisher

Cold Spring Harbor Laboratory

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3