Abstract
Population structure (also called genetic structure and population stratification) is the presence of a systematic difference in allele frequencies between sub-populations in a population as a result of non-random mating between individuals. It can be informative of genetic ancestry, and in the context of medical genetics it is an important confounding variable in genome wide association studies. Recently, many nonlinear dimensionality reduction techniques have been proposed for the population structure visualization task. However, an objective comparison of these techniques has so far been missing from the literature. In this paper, we discuss the previously proposed nonlinear techniques and some of their potential weaknesses. We then propose a novel quantitative evaluation methodology for comparing these nonlinear techniques, based on populations for which pedigree is either known a-priori through artificial selection or simulation. Based on this evaluation metric, we find graph-based algorithms such as t-SNE and UMAP to be superior to PCA, while neural network based methods fall behind.
Publisher
Cold Spring Harbor Laboratory
Reference39 articles.
1. Latent space oddity: on the curvature of deep generative models,2017
2. Ausmees, K. and Nettelblad, C. (2020). A deep learning framework for characterization of genotype data. bioRxiv.
3. Uncovering the novel characteristics of Asian honey bee, Apis cerana, by whole genome sequencing
4. Battey, C. J. , Coffing, G. C. , and Kern, A. D. (2021). Visualizing population structure with variational autoencoders. G3 Genes|Genomes|Genetics, 11(1).
5. Generating sentences from a continuous space,2015