Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization

Author:

Rivera-Zavala Renzo M.ORCID,Martínez Paloma

Abstract

Abstract Background The volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain. Methods In this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture. Results Several Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80% . For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset. Conclusion These results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages.

Funder

Ministerio de Economía y Competitividad

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Reference63 articles.

1. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):267–70. https://doi.org/10.1093/nar/gkh061.

2. Aronson A, Lang F-M. An overview of metamap: historical perspective and recent advances. J Am Med Inform Assoc JAMIA. 2010;17:229–36. https://doi.org/10.1136/jamia.2009.002733.

3. Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning. ICML ’01. San Francisco: Morgan Kaufmann Publishers Inc.; 2001. p. 282–9. http://dl.acm.org/citation.cfm?id=645530.655813.

4. Segura-Bedmar I, Martinez P, Sanchez-Cisneros D. The 1st ddiextraction-2011 challenge task: extraction of drug-drug interactions from biomedical texts, vol. 2011; 2011. p. 1–9.

5. Segura-Bedmar I, Martínez P, Herrero-Zazo M. Lessons learnt from the ddiextraction-2013 shared task. J Biomed Inform. 2014;51:152–64. https://doi.org/10.1016/j.jbi.2014.05.007.

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A comparative analysis of Spanish Clinical encoder-based models on NER and classification tasks;Journal of the American Medical Informatics Association;2024-03-15

2. Biomedical Named Entity Recognition Using Transformers with biLSTM + CRF and Graph Convolutional Neural Networks;2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA);2022-08-08

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3