Data Augmentation and Transfer Learning for Cross-lingual Named Entity Recognition in the Biomedical Domain

Author:

Lancheros Brayan Stiven1,Corpas-Pastor Gloria2,Mitkov Ruslan1

Affiliation:

1. University of Wolverhampton

2. Universidad de Malaga, IUITLM

Abstract

Abstract Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the CRAFT (Colorado Richly Annotated Full-Text) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. Further, we evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.

Publisher

Research Square Platform LLC

Reference36 articles.

1. Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D., Baumgartner, W. A. Jr., Cohen, K. B., Verspoor, K., Blake, J. A., & Hunter, L. E. (2012). Concept Annotation in the CRAFT Corpus. BMC Bioinformatics [online]. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161. [PubMed:22776079]

2. Basaldella, M., Furrer, L., Tasso, C., & Rinaldi, F. (2017). Entity recognition in the biomedical domain using a hybrid approach. Journal of biomedical semantics, 8(1) 51 [online]. Available at: https://doi.org/10.1186/s13326-017-0157-6

3. Beltagy, I., Lo, K., & Cohan, A. (2019). “SCIBERT: A Pretrained Language Model for Scientific Text.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing [online]. pages 3615–3620, Hong Kong, China, November 3–7, 2019. Available at: https://aclanthology.org/D19-1371.pdf

4. Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario [online]. Available at: arXiv:2109.03570.

5. Cho, H., & Lee, H. (2019). Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinformatics [online]. pp. 20, 735 (2019). Available at: https://doi.org/10.1186/s12859-019-3321-4

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3