Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets
Author:
Zhang Bo1ORCID, He Jianghua1, Hu Jinxiang1, Chalise Prabhakar1, Koestler Devin C.1
Affiliation:
1. Department of Biostatistics & Data Science , University of Kansas Medical Center , Kansas City , KS 66160 , USA
Abstract
Abstract
Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.
Funder
National Cancer Institute (NCI) Cancer Center Support Grant the Kansas Institute for Precision Medicine COBRE, supported by the National Institute of General Medical Science award the Kansas IDeA Network of Biomedical Research Excellence Bioinformatics Core, supported by the National Institute of General Medical Science award
Publisher
Walter de Gruyter GmbH
Subject
Computational Mathematics,Genetics,Molecular Biology,Statistics and Probability
Reference41 articles.
1. Balakrishnan, S., Wainwright, M.J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Stat. 45: 77–120, https://doi.org/10.1214/16-aos1435. 2. Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S., Wilson, C.J., Lehár, J., Kryukov, G.V., Sonkin, D., et al.. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483: 603–607, https://doi.org/10.1038/nature11003. 3. Bayazit, Y.A. and Yilmaz, M. (2006). An overview of hereditary hearing loss. ORL J. Otorhinolaryngol. Relat. Spec. 68: 57–63, https://doi.org/10.1159/000091090. 4. Chang, W., Wan, C., Yu, C., Yao, W., Zhang, C., and Cao, S. (2020a). RobMixReg: an R package for robust, flexible and high dimensional mixture regression. bioRxiv, 2020.2008.2002.233460. 5. Chang, W., Wan, C., Zang, Y., Zhang, C., and Cao, S. (2020b). Supervised clustering of high-dimensional data using regularized mixture modeling. Briefings Bioinf. 22: 1–11, https://doi.org/10.1093/bib/bbaa291.
|
|