Author:
Szatkownik Antoine,Furtlehner Cyril,Charpiat Guillaume,Yelmen Burak,Jay Flora
Abstract
AbstractSynthetic data generation via generative modeling has recently become a prominent research field in genomics, with applications ranging from functional sequence design to high-quality, privacy-preserving artificial in silico genomes. Following a body of work on Artificial Genomes (AGs) created via various generative models trained with raw genomic input, we propose a conceptually different approach to address the issues of scalability and complexity of genomic data generation in very high dimensions. Our method combines dimensionality reduction, achieved by Principal Component Analysis (PCA), and a Generative Adversarial Network (GAN) learning in this reduced space. Using this framework, we generated genomic proxy datasets for very diverse human populations around the world. We compared the quality of AGs generated by our approach with AGs generated by the established models and report improvements in capturing population structure, linkage disequilibrium, and metrics related to privacy leakage. Furthermore, we developed a frugal model with orders of magnitude fewer parameters and comparable performance to larger models. For quality assessment, we also implemented a new evaluation metric based on information theory to measure local haplotypic diversity, showing that generative models yield higher diversity than real genomes. In addition, we addressed the shrinkage issue associated with PCA and generative modeling, examined its relation to the nearest neighbor resemblance metric, and proposed a resolution. Finally, we evaluated the effect of different binarization methods on the quality of the output AGs.
Publisher
Cold Spring Harbor Laboratory