Automating document classification for the Immune Epitope Database-Reference-Cited by-同舟云学术

Automating document classification for the Immune Epitope Database

Published:2007-07-26 Issue:1 Volume:8 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Wang Peng,Morgan Alexander A,Zhang Qing,Sette Alessandro,Peters Bjoern

Abstract

Abstract Background The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose. Results We here report our experience in automating this process using Naïve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified. Conclusion By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-8-269.pdf

Reference34 articles.

1. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic acids research 2004, 32(Database issue):D115–9. 10.1093/nar/gkh131

2. GeneRIF[http://www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html]

3. Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, Anagnostopoulos A, Baldarelli RM, Baya M, Beal JS, Bello SM, Boddy WJ, Bradt DW, Burkart DL, Butler NE, Campbell J, Cassell MA, Corbani LE, Cousins SL, Dahmen DJ, Dene H, Diehl AD, Drabkin HJ, Frazer KS, Frost P, Glass LH, Goldsmith CW, Grant PL, Lennon-Pierce M, Lewis J, Lu I, Maltais LJ, McAndrews-Hill M, McClellan L, Miers DB, Miller LA, Ni L, Ormsby JE, Qi D, Reddy TB, Reed DJ, Richards-Smith B, Shaw DR, Sinclair R, Smith CL, Szauter P, Walker MB, Walton DO, Washburn LL, Witham IT, Zhu Y: The Mouse Genome Database (MGD): from genes to mice--a community resource for mouse biology. Nucleic acids research 2005, 33(Database issue):D471–5. 10.1093/nar/gki113

4. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic acids research 2004, 32(Database issue):D277–80. 10.1093/nar/gkh063

5. Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic acids research 2002, 30(1):303–305. 10.1093/nar/30.1.303

Cited by 39 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A meta-analysis of epitopes in prostate-specific antigens identifies opportunities and knowledge gaps;Human Immunology;2023-11

2. Functional relationship of SNP (Ala490Thr) of an epigenetic gene EZH2 results in the progression and poor survival of ER+/tamoxifen treated breast cancer patients;Journal of Genetics;2021-10

3. The Cancer Epitope Database and Analysis Resource: A Blueprint for the Establishment of a New Bioinformatics Resource for Use by the Cancer Immunology Community;Frontiers in Immunology;2021-08-24

4. DECAB-LSTM: Deep Contextualized Attentional Bidirectional LSTM for cancer hallmark classification;Knowledge-Based Systems;2020-12

5. The Immune Epitope Database and Analysis Resource Program 2003–2018: reflections and outlook;Immunogenetics;2019-11-25