Abstract
AbstractGene set analysis (GSA) remains a common step in genome-scale studies because it can reveal insights that are not apparent from results obtained for individual genes. Many different computational tools are applied for GSA, which may be sensitive to different types of signals; however, most methods test whether there are differences in the distribution of the effect of some experimental condition between genes in gene sets of interest. We have developed a unifying framework for GSA that first fits effect size distributions, and then tests for differences in these distributions between gene sets. These differences can be in the proportions of genes that are perturbed or in the sign or size of the effects. Inspired by statistical meta-analysis, we take into account the uncertainty in effect size estimates to reduce the influence of genes with greater uncertainty in effect size estimate on distribution parameters. We demonstrate, using simulation and by application to real data, that this approach provides significant gains in performance over existing methods. Furthermore, the statistical tests carried out are defined in terms of effect sizes, rather than the results of prior statistical tests measuring these changes, which leads to improved interpretability and greater robustness to variation in sample sizes. We also show that the approach naturally suggests alternative test types that are not usually considered in GSA; it can, for example, be applied to identify differences in effect size distributions between sample subgroups in a gene set of interest. Applying this approach to an analysis of gene expression changes between matched colon tumour and normal samples, we found several gene sets that showed distinct behaviour in patient subgroups with different prognoses. These may help to explain the clinical differences that have been reported between these patient groups.Author summaryThe role of gene set analysis is to identify groups of genes that are perturbed in a genomics experiment. There are many tools available for this task and they do not all test for the same types of changes. Here we propose a new way to carry out gene set analysis that involves first working out the distribution of the group effect in the gene set and then comparing this distribution to the equivalent distribution in other genes. Tests performed by existing tools for gene set analysis can be related to different comparisons in these distributions of group effects. A unified framework for gene set analysis provides for more explicit null hypotheses against which to test sets of genes for different types of responses to the experimental conditions. These results are more interpretable, because the group effect distributions can be compared visually, providing an indication of how the experimental effect differs between the gene sets. We can also apply this method to identify sets of genes that behave atypically in subgroups of samples. This enabled us to identify differences in the expression of several gene sets in colon cancer samples between individuals with reduced mortality and those without this benefit.
Publisher
Cold Spring Harbor Laboratory