Methods for evaluating unsupervised vector representations of genomic regions-Reference-Cited by-同舟云学术

Methods for evaluating unsupervised vector representations of genomic regions

Published:2024-07-02 Issue:3 Volume:6 Page:
ISSN:2631-9268
Container-title:NAR Genomics and Bioinformatics
language:en
Short-container-title:

Author:

Zheng Guangtao¹^ORCID,Rymuza Julia²^ORCID,Gharavi Erfaneh²³^ORCID,LeRoy Nathan J²⁴^ORCID,Zhang Aidong¹³⁴^ORCID,Sheffield Nathan C²³⁴⁵⁶⁷^ORCID

Affiliation:

1. Department of Computer Science, School of Engineering, University of Virginia , Charlottesville, VA 22908, USA

2. Department of Genome Sciences, School of Medicine, University of Virginia , Charlottesville, VA 22908, USA

3. School of Data Science, University of Virginia , Charlottesville, VA 22904, USA

4. Department of Biomedical Engineering, School of Medicine, University of Virginia , Charlottesville, VA 22904, USA

5. Department of Public Health Sciences, School of Medicine, University of Virginia , Charlottesville, VA 22908, USA

6. Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia , Charlottesville, VA 22908, USA

7. Child Health Research Center, School of Medicine, University of Virginia , Charlottesville, VA 22908, USA

Abstract

Abstract Representation learning models have become a mainstay of modern genomics. These models are trained to yield vector representations, or embeddings, of various biological entities, such as cells, genes, individuals, or genomic regions. Recent applications of unsupervised embedding approaches have been shown to learn relationships among genomic regions that define functional elements in a genome. Unsupervised representation learning of genomic regions is free of the supervision from curated metadata and can condense rich biological knowledge from publicly available data to region embeddings. However, there exists no method for evaluating the quality of these embeddings in the absence of metadata, making it difficult to assess the reliability of analyses based on the embeddings, and to tune model training to yield optimal results. To bridge this gap, we propose four evaluation metrics: the cluster tendency score (CTS), the reconstruction score (RCS), the genome distance scaling score (GDSS), and the neighborhood preserving score (NPS). The CTS and RCS statistically quantify how well region embeddings can be clustered and how well the embeddings preserve information in training data. The GDSS and NPS exploit the biological tendency of regions close in genomic space to have similar biological functions; they measure how much such information is captured by individual region embeddings in a set. We demonstrate the utility of these statistical and biological scores for evaluating unsupervised genomic region embeddings and provide guidelines for learning reliable embeddings.

Funder

National Institute of General Medical Sciences

National Human Genome Research Institute

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/nargab/article-pdf/6/3/lqae086/58792719/lqae086.pdf

Reference26 articles.

1. An integrated encyclopedia of DNA elements in the human genome;Encode Project Consortium;Nature,2012

2. ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions;Furey;Nat. Rev. Genet.,2012

3. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position;Buenrostro;Nat. Methods,2013

4. Analytical approaches for ATAC-seq data analysis;Smith;Curr. Protocol. Hum. Genet.,2020

5. Comprehensive genomic characterization defines human glioblastoma genes and core pathways;Research Network, C.G.A. (TCGA);Nature,2008

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Methods for constructing and evaluating consensus genomic interval sets;Nucleic Acids Research;2024-08-24