Automatic document classification of biological literature-Reference-Cited by-同舟云学术

Automatic document classification of biological literature

Published:2006-08-07 Issue:1 Volume:7 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Chen David,Müller Hans-Michael,Sternberg Paul W

Abstract

Abstract Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-7-370.pdf

Reference25 articles.

1. Andrade MA, Bork P: Automated extraction or information in molecular biology. FEBS Lett 2000, 476: 12–17.

2. De Bruijn B, Martin J: Getting to the (c)ore of knowledge: Mining biomedical literature. Int J Med Inf 2002, 67: 7–18.

3. Staab S, (editor): Mining information for function genomics. IEEE Intell Syst 2002, 17: 66.

4. Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 2006, 7: 119–129.

5. Muller HM, Kenny EE, Sternberg PW: Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2004, 2: e309.

Cited by 33 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Automated Identification of Immunocompromised Status in Critically Ill Children;Methods of Information in Medicine;2022-04-05

2. Review of feature extraction approaches on biomedical text classification;International Journal of ADVANCED AND APPLIED SCIENCES;2020-04

3. Informationsextraktion und kartografische Visualisierung von Haushaltsplänen mit AutoML-Methoden;Künstliche Intelligenz in Wirtschaft & Gesellschaft;2020

4. Bibliographic automatic classification algorithm based on semantic space transformation;Multimedia Tools and Applications;2019-03-08

5. An effective biomedical document classification scheme in support of biocuration: addressing class imbalance;Database;2019-01-01