Abstract
ABSTRACTAll tissue-based gene expression studies are impacted by biological and technical sources of variation. Numerous methods are used to normalize and batch correct these datasets. A more accurate understanding of all causes of variation could further optimize these approaches. We used 17,282 samples from 49 tissues in the Genotype Tissue Expression (GTEx) dataset (v8) to investigate patterns and causes of expression variation. Transcript expression was normalized to Z-scores and only the most variable 2% of transcripts were evaluated and clustered based on co-expression patterns. Clustered gene sets were solved to different biological or technical causes related to metadata elements and histologic images. We identified 522 variable transcript clusters (median 11 per tissue) across the samples. Of these, 64% were confidently explained, 15% were likely explained, 7% were low confidence explanations and 14% had no clear cause. Common causes included sex, sequencing contamination, immunoglobulin diversity, and compositional tissue differences. Less common biological causes included death interval (Hardy score), muscle atrophy, diabetes status, and menopause. Technical causes included brain pH and harvesting differences. Many of the causes of variation in bulk tissue expression were identifiable in the Tabula Sapiens dataset of single cell expression. This is the largest exploration of the underlying sources of tissue expression variation. It uncovered expected and unexpected causes of variable gene expression. These identified sources of variation will inform which metadata to acquire with tissue harvesting and can be used to improve normalization, batch correction, and analysis of both bulk and single cell RNA-seq data.
Publisher
Cold Spring Harbor Laboratory