Methods for evaluating unsupervised vector representations of genomic regions-Reference-Cited by-同舟云学术

Methods for evaluating unsupervised vector representations of genomic regions

Published:2023-08-29 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Zheng Guangtao^ORCID,Rymuza Julia^ORCID,Gharavi Erfaneh^ORCID,LeRoy Nathan J.^ORCID,Zhang Aidong^ORCID,Sheffield Nathan C.^ORCID

Abstract

BackgroundRepresentation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results.MethodsTo bridge this gap, we propose four evaluation metrics: the cluster tendency test (CTT), the reconstruction test (RCT), the genome distance scaling test (GDST), and the neighborhood preserving test (NPT). The CTT and RCT are statistical methods that evaluate how well region embeddings can be clustered and how much the embeddings can preserve the information contained in training data. The GDST and NPT exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings and a set of region embeddings.ResultsWe demonstrate the utility of these statistical and biological tests for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.AvailabilityCode is available athttps://github.com/databio/geniml.

Publisher

Cold Spring Harbor Laboratory

Reference25 articles.

1. An integrated encyclopedia of DNA elements in the human genome

2. ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions

3. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position

4. Analytical approaches for ATAC-seq data analysis;Current Protocols in Human Genetics,2020

5. Comprehensive genomic characterization defines human glioblastoma genes and core pathways

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings;NAR Genomics and Bioinformatics;2024-07-02

2. Joint Representation Learning for Retrieval and Annotation of Genomic Interval Sets;Bioengineering;2024-03-08

3. Joint representation learning for retrieval and annotation of genomic interval sets;2023-08-22

4. Methods for constructing and evaluating consensus genomic interval sets;2023-08-05

5. Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings;2023-08-03