Author:
Cooke Emma J,Savage Richard S,Kirk Paul DW,Darkins Robert,Wild David L
Abstract
Abstract
Background
Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques.
Results
We present a generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data. By using a mixture model likelihood, our method permits a small proportion of the data to be modelled as outlier measurements, and adopts an empirical Bayes approach which uses replicate observations to inform a prior distribution of the noise variance. The method automatically learns the optimum number of clusters and can incorporate non-uniformly sampled time points. Using a wide variety of experimental data sets, we show that our algorithm consistently yields higher quality and more biologically meaningful clusters than current state-of-the-art methodologies. We highlight the importance of modelling outlier values by demonstrating that noisy genes can be grouped with other genes of similar biological function. We demonstrate the importance of including replicate information, which we find enables the discrimination of additional distinct expression profiles.
Conclusions
By incorporating outlier measurements and replicate values, this clustering algorithm for time series microarray data provides a step towards a better treatment of the noise inherent in measurements from high-throughput genomic technologies. Timeseries BHC is available as part of the R package 'BHC' (version 1.5), which is available for download from Bioconductor (version 2.9 and above) via http://www.bioconductor.org/packages/release/bioc/html/BHC.html?pagewanted=all.
Publisher
Springer Science and Business Media LLC
Subject
Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology
Reference39 articles.
1. Stegle O, Denby KJ, Cooke EJ, Wild DL, Ghahramani Z, Borgwardt KM: A Robust Bayesian Two-Sample Test for Detecting Intervals of Differential Gene Expression in Microarray Time Series. Journal of Computational Biology. 2010, 17: 355-367. 10.1089/cmb.2009.0175.
2. Eisen M, Spellman P, Brown P, Botstein D: Cluster Analysis and Display of Genome-wide Expression. Proceedings of the National Academy of Sciences. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.
3. McLachlan GJ, Bean RW, Peel D: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics. 2002, 18: 413-422. 10.1093/bioinformatics/18.3.413.
4. Schliep A, Costa IG, Steinhoff C, Schonhuth A: Analyzing Gene Expression Time-Courses. IEEE/ACM Trans Comput Biol Bioinform. 2005, 2: 179-193. 10.1109/TCBB.2005.31.
5. Beal M, Krishnamurthy P: Gene Expression Time Course Clustering with Countably Infinite Hidden Markov Models. Proceedings of the Proceedings of the Twenty-Second Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-06). 2006, Arlington, Virginia: AUAI Press, 23-30.
Cited by
51 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献