Data Augmentation and Transfer Learning for Cross-lingual Named Entity Recognition in the Biomedical Domain
Author:
Lancheros Brayan Stiven1, Corpas-Pastor Gloria2, Mitkov Ruslan1
Affiliation:
1. University of Wolverhampton 2. Universidad de Malaga, IUITLM
Abstract
Abstract
Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the CRAFT (Colorado Richly Annotated Full-Text) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. Further, we evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.
Publisher
Research Square Platform LLC
Reference36 articles.
1. Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D., Baumgartner, W. A. Jr., Cohen, K. B., Verspoor, K., Blake, J. A., & Hunter, L. E. (2012). Concept Annotation in the CRAFT Corpus. BMC Bioinformatics [online]. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161. [PubMed:22776079] 2. Basaldella, M., Furrer, L., Tasso, C., & Rinaldi, F. (2017). Entity recognition in the biomedical domain using a hybrid approach. Journal of biomedical semantics, 8(1) 51 [online]. Available at: https://doi.org/10.1186/s13326-017-0157-6 3. Beltagy, I., Lo, K., & Cohan, A. (2019). “SCIBERT: A Pretrained Language Model for Scientific Text.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing [online]. pages 3615–3620, Hong Kong, China, November 3–7, 2019. Available at: https://aclanthology.org/D19-1371.pdf 4. Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario [online]. Available at: arXiv:2109.03570. 5. Cho, H., & Lee, H. (2019). Biomedical named entity recognition using deep neural networks with contextual information. BMC Bioinformatics [online]. pp. 20, 735 (2019). Available at: https://doi.org/10.1186/s12859-019-3321-4
|
|