NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition-Reference-Cited by-同舟云学术

NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition

Published:2006-12 Issue:S5 Volume:7 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Tsai Richard Tzong-Han,Sung Cheng-Lung,Dai Hong-Jie,Hung Hsieh-Chuan,Sung Ting-Yi,Hsu Wen-Lian

Abstract

Abstract Background Biomedical named entity recognition (Bio-NER) is a challenging problem because, in general, biomedical named entities of the same category (e.g., proteins and genes) do not follow one standard nomenclature. They have many irregularities and sometimes appear in ambiguous contexts. In recent years, machine-learning (ML) approaches have become increasingly common and now represent the cutting edge of Bio-NER technology. This paper addresses three problems faced by ML-based Bio-NER systems. First, most ML approaches usually employ singleton features that comprise one linguistic property (e.g., the current word is capitalized) and at least one class tag (e.g., B-protein, the beginning of a protein name). However, such features may be insufficient in cases where multiple properties must be considered. Adding conjunction features that contain multiple properties can be beneficial, but it would be infeasible to include all conjunction features in an NER model since memory resources are limited and some features are ineffective. To resolve the problem, we use a sequential forward search algorithm to select an effective set of features. Second, variations in the numerical parts of biomedical terms (e.g., "2" in the biomedical term IL2) cause data sparseness and generate many redundant features. In this case, we apply numerical normalization, which solves the problem by replacing all numerals in a term with one representative numeral to help classify named entities. Third, the assignment of NE tags does not depend solely on the target word's closest neighbors, but may depend on words outside the context window (e.g., a context window of five consists of the current word plus two preceding and two subsequent words). We use global patterns generated by the Smith-Waterman local alignment algorithm to identify such structures and modify the results of our ML-based tagger. This is called pattern-based post-processing. Results To develop our ML-based Bio-NER system, we employ conditional random fields, which have performed effectively in several well-known tasks, as our underlying ML model. Adding selected conjunction features, applying numerical normalization, and employing pattern-based post-processing improve the F-scores by 1.67%, 1.04%, and 0.57%, respectively. The combined increase of 3.28% yields a total score of 72.98%, which is better than the baseline system that only uses singleton features. Conclusion We demonstrate the benefits of using the sequential forward search algorithm to select effective conjunction feature groups. In addition, we show that numerical normalization can effectively reduce the number of redundant and unseen features. Furthermore, the Smith-Waterman local alignment algorithm can help ML-based Bio-NER deal with difficult cases that need longer context windows.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-7-S5-S11.pdf

Reference39 articles.

1. Hu ZZ, Mani I, Hermoso V, Liu H, Wu CH: iProLINK: an integrated protein resource for literature mining. Comput Biol Chem 2004, 28: 409–416. 10.1016/j.compbiolchem.2004.09.010

2. Cohen KB, Hunter L: Natural Language Processing and Systems Biology. In Artificial Intelligence and Systems Biology. Springer. Edited by: Dubitzky W, Azuaje F. ; 2005.

3. Chinchor N: Message Understanding Conference Proceedings. Message Understanding Conference 1998.

4. Shatkay H, Feldman R: Mining the biomedical literature in the genomic era: an overview. Journal of Computational Biology 2003, 10(6):821–855. 10.1089/106652703322756104

5. Pakhomov S: Semi-supervised maximum entropy based approach to acronym and abbreviation normalization in medical text. the 40th Annual Meeting of the Association for Computational Linguistics (ACL) 2002.

Cited by 79 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Advancing entity recognition in biomedicine via instruction tuning of large language models;Bioinformatics;2024-03-21

2. OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study;Journal of Medical Internet Research;2023-12-06

3. Chinese Clinical Named Entity Recognition From Electronic Medical Records Based on Multisemantic Features by Using Robustly Optimized Bidirectional Encoder Representation From Transformers Pretraining Approach Whole Word Masking and Convolutional Neural Networks: Model Development and Validation;JMIR Medical Informatics;2023-05-10

4. OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study (Preprint);2023-04-13

5. An optimization based feature extraction and machine learning techniques for named entity identification;Optik;2023-02