Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text-Reference-Cited by-同舟云学术

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text

Published:2005-05 Issue:S1 Volume:6 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Ray Soumya,Craven Mark

Abstract

Abstract Background The BioCreative text mining evaluation investigated the application of text mining methods to the task of automatically extracting information from text in biomedical research articles. We participated in Task 2 of the evaluation. For this task, we built a system to automatically annotate a given protein with codes from the Gene Ontology (GO) using the text of an article from the biomedical literature as evidence. Methods Our system relies on simple statistical analyses of the full text article provided. We learn n-gram models for each GO code using statistical methods and use these models to hypothesize annotations. We also learn a set of Naïve Bayes models that identify textual clues of possible connections between the given protein and a hypothesized annotation. These models are used to filter and rank the predictions of the n-gram models. Results We report experiments evaluating the utility of various components of our system on a set of data held out during development, and experiments evaluating the utility of external data sources that we used to learn our models. Finally, we report our evaluation results from the BioCreative organizers. Conclusion We observe that, on the test data, our system performs quite well relative to the other systems submitted to the evaluation. From other experiments on the held-out data, we observe that (i) the Naïve Bayes models were effective in filtering and ranking the initially hypothesized annotations, and (ii) our learned models were significantly more accurate when external data sources were used during learning.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-6-S1-S18.pdf

Reference12 articles.

1. The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556

2. Porter MF: An Algorithm for Suffix Stripping. Program 1980, 14(3):127–130.

3. National Library of Medicine: Unified Medical Language System.1999. [http://www.nlm.nih.gov/research/umls/umlsmain.html]

4. Bairoch A, Apweiler R: The SWISS-PROT Protein Sequence Data Bank and its Supplement TrEMBL. Nucleic Acids Research 1997, 25: 31–36. 10.1093/nar/25.1.31

5. Wain HM, Bruford EA, Lovering RC, Lush MJ, Wright MW, Povey S: Guidelines for Human Gene Nomenclature. Genomics 2002, 79: 464–470. 10.1006/geno.2002.6748

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring Multiple Instance Learning (MIL): A brief survey;Expert Systems with Applications;2024-09

2. Multiple-instance Learning from Triplet Comparison Bags;ACM Transactions on Knowledge Discovery from Data;2024-02-12

3. Multiple instance learning from similarity-confidence bags;Pattern Recognition;2024-02

4. Multi-instance Embedding Learning Through High-level Instance Selection;Advances in Knowledge Discovery and Data Mining;2022

5. Multi-Instance Ensemble Learning With Discriminative Bags;IEEE Transactions on Systems, Man, and Cybernetics: Systems;2021