Deep convolutional and conditional neural networks for large-scale genomic data generation
-
Published:2023-10-30
Issue:10
Volume:19
Page:e1011584
-
ISSN:1553-7358
-
Container-title:PLOS Computational Biology
-
language:en
-
Short-container-title:PLoS Comput Biol
Author:
Yelmen BurakORCID,
Decelle AurélienORCID,
Boulos Leila Lea,
Szatkownik Antoine,
Furtlehner Cyril,
Charpiat Guillaume,
Jay Flora
Abstract
Applications of generative models for genomic data have gained significant momentum in the past few years, with scopes ranging from data characterization to generation of genomic segments and functional sequences. In our previous study, we demonstrated that generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be used to create novel high-quality artificial genomes (AGs) which can preserve the complex characteristics of real genomes such as population structure, linkage disequilibrium and selection signals. However, a major drawback of these models is scalability, since the large feature space of genome-wide data increases computational complexity vastly. To address this issue, we implemented a novel convolutional Wasserstein GAN (WGAN) model along with a novel conditional RBM (CRBM) framework for generating AGs with high SNP number. These networks implicitly learn the varying landscape of haplotypic structure in order to capture complex correlation patterns along the genome and generate a wide diversity of plausible haplotypes. We performed comparative analyses to assess both the quality of these generated haplotypes and the amount of possible privacy leakage from the training data. As the importance of genetic privacy becomes more prevalent, the need for effective privacy protection measures for genomic data increases. We used generative neural networks to create large artificial genome segments which possess many characteristics of real genomes without substantial privacy leakage from the training dataset. In the near future, with further improvements in haplotype quality and privacy preservation, large-scale artificial genome databases can be assembled to provide easily accessible surrogates of real databases, allowing researchers to conduct studies with diverse genomic data within a safe ethical framework in terms of donor privacy.
Funder
Agence Nationale de la Recherche
Comunidad de Madrid
Banco Santander and the UCM
Fondo Europeo de Desarrollo Regional
Publisher
Public Library of Science (PLoS)
Subject
Computational Theory and Mathematics,Cellular and Molecular Neuroscience,Genetics,Molecular Biology,Ecology,Modeling and Simulation,Ecology, Evolution, Behavior and Systematics
Reference45 articles.
1. Deep learning for population size history inference: Design, comparison and combination with approximate Bayesian computation;T Sanchez;Molecular Ecology Resources,2020
2. Detecting Positive Selection in Populations Using Genetic Data
3. Reaching the End-Game for GWAS: Machine Learning Approaches for the Prioritization of Complex Disease Loci;HL Nicholls;Frontiers in Genetics,2020
4. AI applications in functional genomics;C Caudai;Computational and Structural Biotechnology Journal,2021
5. Deep learning in population genetics;K Korfmann;Genome Biology and Evolution,2023
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献