Abstract
The evolution of RNA-seq technologies has yielded datasets of scientific value that are often generated as condition associated biological replicates within expression studies. With expanding data archives opportunity arises to augment replicate numbers when conditions of interest overlap. Despite correction procedures for estimating transcript abundance, a source of ambiguity is transcript level intra-condition count variation; as indicated by disjointed results between analysis tools. We present TVscript, a tool that removes reference-based transcripts associated with intra-condition count variation above specified thresholds and we explore the effects of such variation on differential expression analysis. Initially iterative differential expression analysis involving simulated counts, where levels of intra-condition variation and sets of over represented transcripts are explicitly specified, was performed. Then counts derived from inter- and intra-study data representing brain samples of dogs, wolves and foxes (wolves vs. dogs and aggressive vs. tame foxes) were used. For simulations, the sensitivity in detecting differentially expressed transcripts increased after removing hyper-variable transcripts, although at levels of intra-condition variation above 5% detection became unreliable. For real data, prior to applying TVscript, ≈20% of the transcripts identified as being differentially expressed were associated with high levels of intra-condition variation, an over representation relative to the reference set. As transcripts harbouring such variation were removed pre-analysis, a discordance from 26 to 40% in the lists of differentially expressed transcripts is observed when compared to those obtained using the non-filtered reference. The removal of transcripts possessing intra-condition variation values within (and above) the 97th and 95th percentiles, for wolves vs. dogs and aggressive vs. tame foxes, maximized the sensitivity in detecting differentially expressed transcripts as a result of alterations within gene-wise dispersion estimates. Through analysis of our real data the support for seven genes with potential for being involved with selection for tameness is provided. TVscript is available at: https://sourceforge.net/projects/tvscript/.
Funder
Fundação para a Ciência e a Tecnologia
European Regional Development Fund
Publisher
Public Library of Science (PLoS)
Reference84 articles.
1. RNA-Seq: A revolutionary tool for transcriptomics;Z Wang;Nat Rev Genet,2009
2. A survey of best practices for RNA-seq data analysis;A Conesa;Genome Biol,2016
3. RNA Sequencing and Analysis;KR Kukurba;Cold Spring Harb Protoc,2015
4. Mapping and quantifying mammalian transcriptomes by RNA-Seq;A Mortazavi;Nat Methods 2008 57,2008
5. Accurate quantification of transcriptome from RNA-Seq data by effective length normalization;S Lee;Nucleic Acids Res,2011