Biomedical and clinical English model packages for the Stanza Python NLP library-Reference-Cited by-同舟云学术

Biomedical and clinical English model packages for the Stanza Python NLP library

Published:2021-06-22 Issue:9 Volume:28 Page:1892-1899
ISSN:1527-974X
Container-title:Journal of the American Medical Informatics Association
language:en
Short-container-title:

Author:

Zhang Yuhao¹,Zhang Yuhui²,Qi Peng²,Manning Christopher D³,Langlotz Curtis P⁴

Affiliation:

1. Biomedical Informatics Training Program, Stanford University, Stanford, California, USA

2. Computer Science Department, Stanford University, Stanford, California, USA

3. Computer Science and Linguistics Departments, Stanford University, Stanford, California, USA

4. Department of Radiology, Stanford University, Stanford, California, USA

Abstract

Abstract Objective The study sought to develop and evaluate neural natural language processing (NLP) packages for the syntactic analysis and named entity recognition of biomedical and clinical English text. Materials and Methods We implement and train biomedical and clinical English NLP pipelines by extending the widely used Stanza library originally designed for general NLP tasks. Our models are trained with a mix of public datasets such as the CRAFT treebank as well as with a private corpus of radiology reports annotated with 5 radiology-domain entities. The resulting pipelines are fully based on neural networks, and are able to perform tokenization, part-of-speech tagging, lemmatization, dependency parsing, and named entity recognition for both biomedical and clinical text. We compare our systems against popular open-source NLP libraries such as CoreNLP and scispaCy, state-of-the-art models such as the BioBERT models, and winning systems from the BioNLP CRAFT shared task. Results For syntactic analysis, our systems achieve much better performance compared with the released scispaCy models and CoreNLP models retrained on the same treebanks, and are on par with the winning system from the CRAFT shared task. For NER, our systems substantially outperform scispaCy, and are better or on par with the state-of-the-art performance from BioBERT, while being much more computationally efficient. Conclusions We introduce biomedical and clinical NLP packages built for the Stanza library. These packages offer performance that is similar to the state of the art, and are also optimized for ease of use. To facilitate research, we make all our models publicly available. We also provide an online demonstration (http://stanza.run/bio).

Publisher

Oxford University Press (OUP)

Subject

Health Informatics

Link

http://academic.oup.com/jamia/article-pdf/28/9/1892/39731803/ocab090.pdf

Reference46 articles.

1. Biomedical language processing: what’s beyond PubMed?;Hunter;Mol Cell,2006

2. Use of electronic health records in U.S. hospitals;Jha;N Engl J Med,2009

3. Literome: PubMed-scale genomic knowledge base in the cloud;Poon;Bioinformatics,2014

4. BioBERT: a pre-trained biomedical language representation model for biomedical text mining;Lee;Bioinformatics,2020

5. AskHERMES: An online question answering system for complex clinical questions;Cao;J Biomed Inform,2011

Cited by 54 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Fine-tuning coreference resolution for different styles of clinical narratives;Journal of Biomedical Informatics;2024-01

2. Deep learning for report generation on chest X-ray images;Computerized Medical Imaging and Graphics;2024-01

3. A Reduced Proteomic Signature in Critically Ill Covid-19 Patients Determined With Plasma Antibody Micro-array and Machine Learning;2023-11-14

4. Machine Learning Approaches for Identification of Potential Biomarkers from Cancer Omics Data;2023-10-28

5. Comparing research trends with patenting activities in the biomedical sector: The case of dementia;Technological Forecasting and Social Change;2023-10