Data preparation and interannotator agreement: BioCreAtIvE Task 1B-Reference-Cited by-同舟云学术

Data preparation and interannotator agreement: BioCreAtIvE Task 1B

Published:2005-05 Issue:S1 Volume:6 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Colosimo Marc E,Morgan Alexander A,Yeh Alexander S,Colombe Jeffrey B,Hirschman Lynette

Abstract

Abstract Background We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organisms Fly, Mouse, and Yeast. This paper describes the preparation and evaluation of answer keys for training and testing. These consisted of lists of normalized gene names found in the abstracts, generated by adapting the gene list for the full journal articles found in the model organism databases. For the training dataset, the gene list was pruned automatically to remove gene names not found in the abstract; for the testing dataset, it was further refined by manual annotation by annotators provided with guidelines. A critical step in interpreting the results of an assessment is to evaluate the quality of the data preparation. We did this by careful assessment of interannotator agreement and the use of answer pooling of participant results to improve the quality of the final testing dataset. Results Interannotator analysis on a small dataset showed that our gene lists for Fly and Yeast were good (87% and 91% three-way agreement) but the Mouse gene list had many conflicts (mostly omissions), which resulted in errors (69% interannotator agreement). By comparing and pooling answers from the participant systems, we were able to add an additional check on the test data; this allowed us to find additional errors, especially in Mouse. This led to 1% change in the Yeast and Fly "gold standard" answer keys, but to an 8% change in the mouse answer key. Conclusion We found that clear annotation guidelines are important, along with careful interannotator experiments, to validate the generated gene lists. Also, abstracts alone are a poor resource for identifying genes in paper, containing only a fraction of genes mentioned in the full text (25% for Fly, 36% for Mouse). We found that there are intrinsic differences between the model organism databases related to the number of synonymous terms and also to curation criteria. Finally, we found that answer pooling was much faster and allowed us to identify more conflicting genes than interannotator analysis.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-6-S1-S12.pdf

Reference9 articles.

1. Hirschman L, Colosimo M, Morgan A, Yeh A: Overview of BioCreAtIvE Task 1B: Normalized Gene Lists. BMC Bioinformatics 2005, 6(Suppl 1):S11. 10.1186/1471-2105-6-S1-S11

2. Jensen RA: Orthologs and paralogs – we need to get it right. Genome Biol 2001, 2(8):INTERACTIONS1002. 10.1186/gb-2001-2-8-interactions1002

3. Mewes HW, Albermann K, Bahr M, Frishman D, Gleissner A, Hani J, Heumann K, Kleine K, Maierl A, Oliver SG, et al.: Overview of the yeast genome. Nature 1997, 387(6632 Suppl):7–65.

4. The FlyBase database of the Drosophila genome projects and community literature Nucleic Acids Res 2003, 31(1):172–175. 10.1093/nar/gkg094

5. The FlyBase Database[http://flybase.org/]

Cited by 25 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A framework for evaluating automatic indexing or classification in the context of retrieval;Journal of the Association for Information Science and Technology;2015-10-22

2. ExportAid: database of RNA elements regulating nuclear RNA export in mammals;Bioinformatics;2014-09-30

3. Harmonization of gene/protein annotations: towards a gold standard MEDLINE;Bioinformatics;2012-03-13

4. A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature;Journal of Documentation;2012-01-13

5. Prioritization of data quality dimensions and skills requirements in genome annotation work;Journal of the American Society for Information Science and Technology;2011-10-04