Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets-Reference-Cited by-同舟云学术

Improving the accuracy and internal consistency of regression-based clustering of high-dimensional datasets

Published:2023-01-01 Issue:1 Volume:22 Page:
ISSN:2194-6302
Container-title:Statistical Applications in Genetics and Molecular Biology
language:en
Short-container-title:

Author:

Zhang Bo¹^ORCID,He Jianghua¹,Hu Jinxiang¹,Chalise Prabhakar¹,Koestler Devin C.¹

Affiliation:

1. Department of Biostatistics & Data Science , University of Kansas Medical Center , Kansas City , KS 66160 , USA

Abstract

Abstract Component-wise Sparse Mixture Regression (CSMR) is a recently proposed regression-based clustering method that shows promise in detecting heterogeneous relationships between molecular markers and a continuous phenotype of interest. However, CSMR can yield inconsistent results when applied to high-dimensional molecular data, which we hypothesize is in part due to inherent limitations associated with the feature selection method used in the CSMR algorithm. To assess this hypothesis, we explored whether substituting different regularized regression methods (i.e. Lasso, Elastic Net, Smoothly Clipped Absolute Deviation (SCAD), Minmax Convex Penalty (MCP), and Adaptive-Lasso) within the CSMR framework can improve the clustering accuracy and internal consistency (IC) of CSMR in high-dimensional settings. We calculated the true positive rate (TPR), true negative rate (TNR), IC and clustering accuracy of our proposed modifications, benchmarked against the existing CSMR algorithm, using an extensive set of simulation studies and real biological datasets. Our results demonstrated that substituting Adaptive-Lasso within the existing feature selection method used in CSMR led to significantly improved IC and clustering accuracy, with strong performance even in high-dimensional scenarios. In conclusion, our modifications of the CSMR method resulted in improved clustering performance and may thus serve as viable alternatives for the regression-based clustering of high-dimensional datasets.

Funder

National Cancer Institute (NCI) Cancer Center Support Grant

the Kansas Institute for Precision Medicine COBRE, supported by the National Institute of General Medical Science award

the Kansas IDeA Network of Biomedical Research Excellence Bioinformatics Core, supported by the National Institute of General Medical Science award

Publisher

Walter de Gruyter GmbH

Subject

Computational Mathematics,Genetics,Molecular Biology,Statistics and Probability

Link

https://www.degruyter.com/document/doi/10.1515/sagmb-2022-0031/pdf

Reference41 articles.

1. Balakrishnan, S., Wainwright, M.J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Stat. 45: 77–120, https://doi.org/10.1214/16-aos1435.

2. Barretina, J., Caponigro, G., Stransky, N., Venkatesan, K., Margolin, A.A., Kim, S., Wilson, C.J., Lehár, J., Kryukov, G.V., Sonkin, D., et al.. (2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483: 603–607, https://doi.org/10.1038/nature11003.

3. Bayazit, Y.A. and Yilmaz, M. (2006). An overview of hereditary hearing loss. ORL J. Otorhinolaryngol. Relat. Spec. 68: 57–63, https://doi.org/10.1159/000091090.

4. Chang, W., Wan, C., Yu, C., Yao, W., Zhang, C., and Cao, S. (2020a). RobMixReg: an R package for robust, flexible and high dimensional mixture regression. bioRxiv, 2020.2008.2002.233460.

5. Chang, W., Wan, C., Zang, Y., Zhang, C., and Cao, S. (2020b). Supervised clustering of high-dimensional data using regularized mixture modeling. Briefings Bioinf. 22: 1–11, https://doi.org/10.1093/bib/bbaa291.