Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing-Reference-Cited by-同舟云学术

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Published:2022-01-31 Issue:1 Volume:3 Page:1-23
ISSN:2691-1957
Container-title:ACM Transactions on Computing for Healthcare
language:en
Short-container-title:ACM Trans. Comput. Healthcare

Author:

Gu Yu¹,Tinn Robert¹,Cheng Hao¹,Lucas Michael¹,Usuyama Naoto¹,Liu Xiaodong¹,Naumann Tristan¹^ORCID,Gao Jianfeng¹,Poon Hoifung¹

Affiliation:

1. Microsoft Research, Redmond, WA

Abstract

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3458754

Reference62 articles.

1. Publicly Available Clinical

2. BioCreative III interactive task: An overview;Arighi Cecilia N.;BMC Bioinformatics,2011

3. Cancer hallmarks analytics tool (CHAT): A text mining approach to organize and evaluate scientific literature on cancer;Baker Simon;Bioinformatics,2017

Cited by 630 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. RIscoper 2.0: A deep learning tool to extract RNA biomedical relation sentences from literature;Computational and Structural Biotechnology Journal;2024-12

2. Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency;Medical Image Analysis;2024-12

3. Hugging Face's impact on medical applications of artificial intelligence;Computational and Structural Biotechnology Reports;2024-12

4. Framework for automation of short answer grading based on domain-specific pre-training;Engineering Applications of Artificial Intelligence;2024-11

5. A pre-trained language model for emergency department intervention prediction using routine physiological data and clinical narratives;International Journal of Medical Informatics;2024-11