Affiliation:
1. Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal
2. Ada Health GmbH Karl-Liebknecht-Str. 1. 10178 Berlin
Abstract
Abstract
The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein–protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial.
We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein–protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures.
Database URL: https://github.com/liseda-lab/kgsim-benchmark.
Funder
Fundação para a Ciência e a Tecnologia
Publisher
Oxford University Press (OUP)
Subject
General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,Information Systems
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献