Supervised dimensionality reduction for big data-Reference-Cited by-同舟云学术

Supervised dimensionality reduction for big data

Published:2021-05-17 Issue:1 Volume:12 Page:
ISSN:2041-1723
Container-title:Nature Communications
language:en
Short-container-title:Nat Commun

Author:

Vogelstein Joshua T.^ORCID,Bridgeford Eric W.,Tang Minh,Zheng Da,Douville Christopher,Burns Randal,Maggioni Mauro

Abstract

AbstractTo solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

Funder

United States Department of Defense | Defense Advanced Research Projects Agency

Publisher

Springer Science and Business Media LLC

Subject

General Physics and Astronomy,General Biochemistry, Genetics and Molecular Biology,General Chemistry

Link

http://www.nature.com/articles/s41467-021-23102-2.pdf

Reference62 articles.

1. Vogelstein, J. T. et al. Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science 344, 386–392 (2014).

2. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems (eds. Pereira, F., Burges, C. J. C., Bottou, L. & Weinberger, K. Q.) 1097–1105 (Curran Associates, Inc. 2012).

3. Fisher, R. A. Theory of statistical estimation. Math. Proc. Cambridge Philos. Soc. 22, 700–725 (1925).

4. Jolliffe, I. T. in Principal Component Analysis, Springer Series in Statistics Ch. 1 (Springer, 1986).

5. Lee, J. A. & Verleysen, M. Nonlinear Dimensionality Reduction (Springer, 2007). .

Cited by 31 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. High-Dimensional Categorical Process Monitoring: A Data Mining Approach;IISE Transactions;2024-09-05

2. DistaNet: grasp-specific distance biofeedback promotes the retention of myoelectric skills;Journal of Neural Engineering;2024-06-01

3. Applications of 3D modeling in cryptic species classification of molluscs;Marine Biology;2024-05-31

4. SUBTLE: An Unsupervised Platform with Temporal Link Embedding that Maps Animal Behavior;International Journal of Computer Vision;2024-05-20

5. Exploring combinations of dimensionality reduction, transfer learning, and regularization methods for predicting binary phenotypes with transcriptomic data;BMC Bioinformatics;2024-04-26