Annotation and initial evaluation of a large annotated German oncological corpus

Author:

Kittner Madeleine1,Lamping Mario23,Rieke Damian T234,Götze Julian5,Bajwa Bariya5,Jelas Ivan3,Rüter Gina3,Hautow Hanjo1,Sänger Mario1,Habibi Maryam1,Zettwitz Marit3,Bortoli Till de3,Ostermann Leonie5,Ševa Jurica1,Starlinger Johannes1,Kohlbacher Oliver6789,Malek Nisar P5,Keilholz Ulrich3,Leser Ulf1

Affiliation:

1. Knowledge Management for Bioinformatics, Humboldt Universität zu Berlin, Berlin, Germany

2. Department of Hematology, Oncology and Cancer Immunology, Campus Benjamin Franklin, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany

3. Charité Comprehensive Cancer Center, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany

4. Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany

5. Innere Medizin I, Universitätsklinikum Tübingen, Tübingen, Germany

6. Institut für Translationale Bioinformatik, Universitätsklinikum Tübingen, Tübingen, Germany

7. Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany

8. Department of Computer Science, University of Tübingen, Tübingen, Germany

9. Biomolecular Interactions, Max Planck Institute for Developmental Biology, Tübingen, Germany

Abstract

Abstract Objective We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. Materials and Methods BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. Results The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72–0.90 for named entity recognition, 0.10–0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. Discussion Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. Conclusion To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.

Funder

German Bundesministerium für Bildung und Forschung

Deutsche Forschungsgemeinschaft

Charité – Universitätsmedizin Berlin and the Berlin Institute of Health

Publisher

Oxford University Press (OUP)

Subject

Health Informatics

Cited by 16 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. GPT for medical entity recognition in Spanish;Multimedia Tools and Applications;2024-04-23

2. Named Entity Recognition in Italian Lung Cancer Clinical Reports using Transformers;2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2023-12-05

3. BELB: a biomedical entity linking benchmark;Bioinformatics;2023-11-01

4. GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment;Journal of Biomedical Informatics;2023-11

5. Transformers for extracting breast cancer information from Spanish clinical narratives;Artificial Intelligence in Medicine;2023-09

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3