Double data piling: a high-dimensional solution for asymptotically perfect multi-category classification-Reference-Cited by-同舟云学术

Double data piling: a high-dimensional solution for asymptotically perfect multi-category classification

Published:2024-04-03 Issue:3 Volume:53 Page:704-737
ISSN:1226-3192
Container-title:Journal of the Korean Statistical Society
language:en
Short-container-title:J. Korean Stat. Soc.

Author:

Kim Taehyun,Chang Woonyoung,Ahn Jeongyoun,Jung Sungkyu^ORCID

Abstract

AbstractFor high-dimensional classification, interpolation of training data manifests as the data piling phenomenon, in which linear projections of data vectors from each class collapse to a single value. Recent research has revealed an additional phenomenon known as the ‘second data piling’ for independent test data in binary classification, providing a theoretical understanding of asymptotically perfect classification. This paper extends these findings to multi-category classification and provides a comprehensive characterization of the double data piling phenomenon. We define the maximal data piling subspace, which maximizes the sum of pairwise distances between piles of training data in multi-category classification. Furthermore, we show that a second data piling subspace that induces data piling for independent data exists and can be consistently estimated by projecting the negatively-ridged discriminant subspace onto an estimated ‘signal’ subspace. By leveraging this second data piling phenomenon, we propose a bias-correction strategy for class assignments, which asymptotically achieves perfect classification. The present research sheds light on benign overfitting and enhances the understanding of perfect multi-category classification of high-dimensional discrimination with a help of high-dimensional asymptotics.

Funder

National Research Foundation of Korea

Seoul National University

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s42952-024-00263-6.pdf

Reference50 articles.

1. Ahn, J., & Jeon, Y. (2015). Sparse HDLSS discrimination with constrained data piling. Computational Statistics & Data Analysis, 90, 74–83. https://doi.org/10.1016/j.csda.2015.04.006

2. Ahn, J., Lee, M. H., & Lee, J. A. (2019). Distance-based outlier detection for high dimension, low sample size data. Journal of Applied Statistics, 46(1), 13–29. https://doi.org/10.1080/02664763.2018.1452901

3. Ahn, J., Lee, M. H., & Yoon, Y. J. (2012). Clustering high dimension, low sample size data using the maximal data piling distance. Statistica Sinica, 22(2), 443–464. https://doi.org/10.5705/ss.2010.148

4. Ahn, J., & Marron, J. S. (2010). The maximal data piling direction for discrimination. Biometrika, 97(1), 254–259. https://doi.org/10.1093/biomet/asp084

5. Ahn, J., Marron, J. S., Muller, K. M., et al. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika, 94(3), 760–766. https://doi.org/10.1093/biomet/asm050