Hyperbolic geometry-based deep learning methods to produce population trees from genotype data

Author:

Patel AmanORCID,Montserrat Daniel MasORCID,Bustamante CarlosORCID,Ioannidis AlexanderORCID

Abstract

AbstractThe production of population-level trees using the genomic data of individuals is a fundamental task in the field of population genetics. Typically, these trees are produced using methods like hierarchical clustering, neighbor joining, or maximum likelihood. However, such methods are non-parametric: they require all data to be present at the time of tree formation, and the addition of new data points necessitates the regeneration of the entire tree, a potentially expensive process. They also do not easily integrate with larger workflows. In this study, we aim to address these problems by introducing parametric deep learning methods for tree formation from genotype data. Our models specifically create continuous representations of population trees in hyperbolic space, which has previously proven highly effective in embedding hierarchically structured data. We present two different architectures - a multi-layer perceptron (MLP) and a variational autoencoder (VAE) - and we analyze their performance using a variety of metrics along with comparisons to established tree-building methods. Both models tested produce embedding spaces that reflect human evolutionary history. In addition, we demonstrate the generalizability of these models by verifying that addition of new samples to an existing tree occurs in a semantically meaningful manner. Finally, we use Dasgupta’s Cost to compare the quality of trees generated by our models to those produced by established methods. Despite the fact that the benchmark methods are directly fit on the evaluation data, our models are able to outperform some of these and achieve highly comparable performance overall.Author summaryTree production is a vital task in population genetics, but current approaches fall prey to several common shortfalls. Most notably, they lack the ability to add new data points after tree generation, and they are often difficult to use in larger pipelines. By leveraging cutting-edge advances pairing deep learning with hyperbolic geometry, we develop multiple models designed to rectify these issues. Through experiments on a dataset of humans from globally widespread ancestries, we demonstrate the generalizability of our models to new data, and we also show strong empirical performance with respect to currently used methods. In addition, we show that the data representations produced by our models are semantically meaningful and reflect known facts about human evolutionary history. Finally, we discuss the additional benefits our models could provide, including improved visualization, greater privacy preservation, and improved integration with downstream machine learning tasks. In conclusion, we present models that are accurate, flexible, and generalizable, with the potential to facilitate a variety of further applications.

Publisher

Cold Spring Harbor Laboratory

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3