Abstract
AbstractCreating and curating knowledge resources has been a paramount activity in the biomedical domain. In recent years, automated methods for knowledge base construction have flourished and have enabled large scale construction and curation of such resources. In the biological domain, techniques such as next generation sequencing produce new data at exponential rate, making mere manual curation of knowledge resources simply unfeasible. The major technology to automate knowledge base construction is Information Extraction — specifically tasks such as Named Entity Recognition or Relation Extraction. The major hurdle for IE methods is the availability of labelled data for training, which can be prohibitively expensive and challenging to obtain due to the need of domain experts. Active learning aims at minimizing the cost of manual labelling by only requiring it for smaller and more useful portions of the data. With this motivation, we devised a method to quickly construct highly curated datasets to enable biomedical knowledge base construction. The method, named BioAct, is based on a partnership between automatic annotation methods (leveraging SciBERT with other machine learning models) and subject matter experts and uses active learning to create training datasets in the biological domain. The main contribution of this work is twofold; in addition to the BioAct method itself, we publicly release an annotated dataset on antimicrobial resistance, produced by a team of subject matter experts using BioAct. Additionally, we simulate a knowledge base construction task using the MegaRes and CARD knowledge bases to provide insight and lessons learned about the usefulness of the annotated dataset for this task.
Publisher
Cold Spring Harbor Laboratory
Reference42 articles.
1. Construction of the literature graph in semantic scholar’;in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,2018
2. Gabor Angeli , Julie Tibshirani , Jean Wu , and Christopher D Manning , ‘Combining distant and partial supervision for relation extraction.’, in EMNLP, pp. 1556–1567, (2014).
3. Distantly supervised web relation extraction for knowledge base population’;Semantic Web,2016
4. Iz Beltagy , Kyle Lo , and Arman Cohan , ‘Scibert: A pretrained language model for scientific text’, in EMNLP-IJCNLP 2019, pp. 3606–3611, (2019).
5. The Unified Medical Language System (UMLS): integrating biomedical terminology