A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature
Author:
Galvez Carmen,de Moya‐Anegón Félix
Abstract
PurposeGene term variation is a shortcoming in text‐mining applications based on biomedical literature‐based knowledge discovery. The purpose of this paper is to propose a technique for normalizing gene names in biomedical literature.Design/methodology/approachUnder this proposal, the normalized forms can be characterized as a unique gene symbol, defined as the official symbol or normalized name. The unification method involves five stages: collection of the gene term, using the resources provided by the Entrez Gene database; encoding of gene‐naming terms in a table or binary matrix; design of a parametrized finite‐state graph (P‐FSG); automatic generation of a dictionary; and matching based on dictionary look‐up to transform the gene mentions into the corresponding unified form.FindingsThe findings show that the approach yields a high percentage of recall. Precision is only moderately high, basically due to ambiguity problems between gene‐naming terms and words and abbreviations in general English.Research limitations/implicationsThe major limitation of this study is that biomedical abstracts were analyzed instead of full‐text documents. The number of under‐normalization and over‐normalization errors is reduced considerably by limiting the realm of application to biomedical abstracts in a well‐defined domain.Practical implicationsThe system can be used for practical tasks in biomedical literature mining. Normalized gene terms can be used as input to literature‐based gene clustering algorithms, for identifying hidden gene‐to‐disease, gene‐to‐gene and gene‐to‐literature relationships.Originality/valueFew systems for gene term variation handling have been developed to date. The technique described performs gene name normalization by dictionary look‐up.
Subject
Library and Information Sciences,Information Systems
Reference74 articles.
1. Aronson, A.R. (2001), “Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program”, Proceedings of AMIA Symposium, pp. 17‐21. 2. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel‐Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M. and Sherlock, G. (2000), “Gene ontology: tool for the unification of biology”, Nature Genetics, Vol. 25, pp. 25‐9. 3. Blake, J.A., Davison, M.T., Eppig, J.T., Maltais, L.J., Povey, S., White, J.A. and Womack, J.E. (1997), “A report on the International Nomenclature Workshop”, Genomics, Vol. 45, pp. 464‐8. 4. Blaschke, C. and Valencia, A. (2001), “Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study”, Comparative and Functional Genomics, Vol. 2, pp. 196‐206. 5. Boyack, K., Mane, K. and Börner, K. (2004), “Mapping Medline papers, genes, and proteins related to melanoma research”, Eighth International Conference on Information Visualization, Proceedings (IV'04), pp. 965‐71.
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|