A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC-Reference-Cited by-同舟云学术

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC

Published:2015-05-05 Issue:5 Volume:22 Page:948-956
ISSN:1527-974X
Container-title:Journal of the American Medical Informatics Association
language:en
Short-container-title:

Author:

Kors Jan A¹,Clematide Simon²,Akhondi Saber A¹,van Mulligen Erik M¹,Rebholz-Schuhmann Dietrich²

Affiliation:

1. Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands

2. Institute of Computational Linguistics, University of Zurich, Zurich, Switzerland

Abstract

Abstract Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.

Publisher

Oxford University Press (OUP)

Subject

Health Informatics

Link

http://academic.oup.com/jamia/article-pdf/22/5/948/34146393/ocv037.pdf

Reference28 articles.

1. NIH's Big Data to Knowledge initiative and the advancement of biomedical informatics;Ohno-Machado;J Am Med Inform Assoc.,2014

2. Term identification in the biomedical literature;Krauthammer;J Biomed Inform.,2004

3. CALBC silver standard corpus;Rebholz-Schuhmann;J Bioinform Comput Biol.,2010

4. Assessment of NER solutions against the first and second CALBC Silver Standard Corpus;Rebholz-Schuhmann;J Biomed Semantics.,2011

5. The Unified Medical Language System (UMLS): integrating biomedical terminology;Bodenreider;Nucleic Acids Res.,2004

Cited by 41 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Entity normalization in a Spanish medical corpus using a UMLS-based lexicon: findings and limitations;Language Resources and Evaluation;2024-07-02

2. Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools;Journal of the American Medical Informatics Association;2024-06-27

3. Augmenting a Spanish clinical dataset for transformer-based linking of negations and their out-of-scope references;Natural Language Processing;2024-05-17

4. Impact of Translation on Biomedical Information Extraction: Experiment on Real-Life Clinical Notes;JMIR Medical Informatics;2024-04-04

5. Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools;2024-03-15