Abstract
AbstractAssociations between datasets, each comprising many features, can be discovered through multivariate methods like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). Application of CCA/PLS to high-dimensional datasets raises critical questions about reliability and interpretability. To study this, we developed a generative modeling framework to simulate synthetic datasets, parameterized by dimensionality, variance structure, and association strength. We found that CCA/PLS associations could be highly inaccurate when the number of samples per feature is relatively small. For PLS, profiles of feature weights exhibit detrimental bias toward leading principal component axes. We confirmed these trends in state-of-the-art neuroimaging datasets, Human Connectome Project (n≈1000) and UK Biobank (n=20000), finding that only the latter comprised sufficient samples for stable estimates. Analysis of the neuroimaging literature using CCA to map brain-behavior relationships revealed also that the commonly employed sample sizes yield unstable CCA solutions. Finally, we provide a calculator of dataset properties required for CCA/PLS stability. Collectively, we characterize how to limit detrimental effects of overfitting on CCA/PLS stability, and provide recommendations for future studies.
Publisher
Cold Spring Harbor Laboratory
Cited by
44 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献