EmbedGEM: A framework to evaluate the utility of embeddings for genetic discovery-Reference-Cited by-同舟云学术

EmbedGEM: A framework to evaluate the utility of embeddings for genetic discovery

Published:2023-11-25 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Mukherjee Sumit^ORCID,McCaw Zachary R,Pei Jingwen,Merkoulovitch Anna,Tandon Raghav,Soare Tom,Amar David,Somineni Hari,Klein Christoph,Satapati Santhosh,Lloyd David,Probert Christopher,Koller Daphne,O’Dushlaine Colm,Karaletsos Theofanis,

Abstract

AbstractMachine learning derived embeddings are a compressed representation of high content data modalities obtained through deep learning models[1]. Embeddings have been hypothesized to capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have some drawbacks: i) they are often confounded by covariates, and ii) their disease relevance is hard to ascertain. In this work we describe a framework to systematically evaluate the utility of embeddings in genetic discovery called EmbedGEM (EmbeddingGeneticEvaluationMethods). Although, motivated by applications to embeddings, EmbedGEM is equally applicable for other multivariate traits as well.EmbedGEM focuses on comparing embeddings along two axes: i) heritability of the embeddings, and ii) ability to identify ‘disease relevant’ variants. We use the number of genome-wide significant signals and mean/median chi-square statistic as a proxy for the heritability of multivariate traits. To evaluate disease relevance, we compute polygenic risk scores for each orthogonalized component of the embedding (or multivariate comparators) and evaluate their association with a held-out set of patients with high-confidence disease traits. While we introduce some relatively straightforward ways to evaluate heritability and disease relevance, we foresee that our framework can be easily extended by adding more metrics.We demonstrate the utility of EmbedGEM by using it to evaluate embedding and non-embedding traits in two separate datasets: i) a synthetic dataset simulated to demonstrate the ability of the framework to correctly rank traits based on their heritability and disease relevance, ii) data from the UK Biobank focused on NAFLD relevant traits. EmbedGEM is implemented in the form of an easy to use Python-based workflow (https://github.com/insitro/EmbedGEM).

Publisher

Cold Spring Harbor Laboratory

Reference31 articles.

1. Representation Learning: A Review and New Perspectives

2. A Fast Learning Algorithm for Deep Belief Nets

3. Pascal Vincent , Hugo Larochelle , Isabelle Lajoie , Yoshua Bengio , and Pierre-Antoine Manzagol . Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.

4. Ting Chen , Simon Kornblith , Mohammad Norouzi , and Geoffrey Hinton . A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.

5. Bootstrap your own latent-a new approach to self-supervised learning;Advances in neural information processing systems,2020