Creating artificial human genomes using generative neural networks

Author:

Yelmen BurakORCID,Decelle AurélienORCID,Ongaro LindaORCID,Marnetto DavideORCID,Tallec CorentinORCID,Montinaro FrancescoORCID,Furtlehner Cyril,Pagani Luca,Jay FloraORCID

Abstract

Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation in the field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the complex distributions of real genomic datasets and generate novel high-quality artificial genomes (AGs) with none to little privacy loss. We show that our generated AGs replicate characteristics of the source dataset such as allele frequencies, linkage disequilibrium, pairwise haplotype distances and population structure. Moreover, they can also inherit complex features such as signals of selection. To illustrate the promising outcomes of our method, we showed that imputation quality for low frequency alleles can be improved by data augmentation to reference panels with AGs and that the RBM latent space provides a relevant encoding of the data, hence allowing further exploration of the reference dataset and features for solving supervised tasks. Generative models and AGs have the potential to become valuable assets in genetic studies by providing a rich yet compact representation of existing genomes and high-quality, easy-access and anonymous alternatives for private databases.

Funder

European Regional Development Fund

Eesti Teadusagentuur

Domaine d’Intérêt Majeur One Health 2017

Atracción de Talento

Laboratoire de Recherche en Informatique

Publisher

Public Library of Science (PLoS)

Subject

Cancer Research,Genetics (clinical),Genetics,Molecular Biology,Ecology, Evolution, Behavior and Systematics

Reference66 articles.

1. DNA sequencing technologies: 2006–2016;ER Mardis;Nature Protocols,2017

2. A Human Genome Diversity Cell Line Panel;HM Cann;Science (80-),2002

3. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations;S Mallick;Nature,2016

4. Genomics is failing on diversity;AB Popejoy;Nature,2016

5. The Missing Diversity in Human Genetic Studies;G Sirugo;Cell,2019

Cited by 70 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3