Affiliation:
1. University of Lausanne , Lausanne , Switzerland
Abstract
Abstract
Normality, in the colloquial sense, has historically been considered an aspirational trait, synonymous with ideality. The arithmetic average and, by extension, statistics including linear regression coefficients, have often been used to characterize normality, and are often used as a way to summarize samples and identify outliers. We provide intuition behind the behavior of such statistics in high dimensions, and demonstrate that even for datasets with a relatively low number of dimensions, data start to exhibit a number of peculiarities which become severe as the number of dimensions increases. Whilst our main goal is to familiarize researchers with these peculiarities, we also show that normality can be better characterized with ‘typicality’, an information theoretic concept relating to entropy. An application of typicality to both synthetic and real-world data concerning political values reveals that in multi-dimensional space, to be ‘normal’ is actually to be atypical. We briefly explore the ramifications for outlier detection, demonstrating how typicality, in contrast with the popular Mahalanobis distance, represents a viable method for outlier detection.
Subject
Sociology and Political Science,Statistics and Probability,Economics, Econometrics and Finance (miscellaneous)
Reference41 articles.
1. Blum, A., J. Hopcroft, and R. Kannan. 2020. Foundations of Data Science. Cambridge: Cambridge University Press.
2. Bolger, N., and J. P. Laurenceau. 2013. Intensive Longitudinal Methods. New York: The Guilford Press.
3. Caponi, S. 2013. “Quetelet, the Average Man and Medical Knowledge.” Hist Cienc Saude Manguinhos Hist Cienc Saude Manguinhos 20 (3): 830–47. https://doi.org/10.1590/s0104-59702013005000011.
4. Comte, A. 1976. The Foundation of Sociology, edited by K. Thompson. London: Nelson.
5. Cover, T. M., and J. A. Thomas. 2006. Elements of Information Theory. New York: John Wiley and Sons Inc.