Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy-Reference-Cited by-同舟云学术

Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy

Published:2009-01-21 Issue:1 Volume:10 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Alexopoulou Dimitra,Andreopoulos Bill,Dietze Heiko,Doms Andreas,Gandon Fabien,Hakenberg Jörg,Khelif Khaled,Schroeder Michael,Wächter Thomas

Abstract

Abstract Background Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. Results The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. Conclusion Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. Availability The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-10-28.pdf

Reference51 articles.

1. Schuemie MJ, Kors JA, Mons B: Word sense disambiguation in the biomedical domain: an overview. J Comput Biol 2005, 12(5):554–565.

2. Gale WA, Church KW, Yarowsky D: One sense per discourse. In HLT '91: Proceedings of the workshop on Speech and Natural Language. Morristown, NJ, USA: Association for Computational Linguistics; 1992:233–237.

3. Yarowsky D: One sense per collocation. In HLT '93: Proceedings of the workshop on Human Language Technology. Morristown, NJ, USA: Association for Computational Linguistics; 1993:266–271.

4. Weeber M, Mork JG, Aronson AR: Developing a Test Collection for Biomedical Word Sense Disambiguation. Proc AMIA Symp 2001, 746–750.

5. Automatic extraction of acronym-meaning pairs from MEDLINE databases Stud Health Technol Inform 2001, 84(Pt 1):371–375.

Cited by 22 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The changing landscape of text mining: a review of approaches for ecology and evolution;Proceedings of the Royal Society B: Biological Sciences;2024-07-31

2. A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models;Scientific Data;2024-05-04

3. Past and future uses of text mining in ecology and evolution;Proceedings of the Royal Society B: Biological Sciences;2022-05-18

4. MeSH-Based Semantic Indexing Approach to Enhance Biomedical Information Retrieval;The Computer Journal;2020-07-09

5. SmartData 4.0: a formal description framework for big data;The Journal of Supercomputing;2018-12-04