Discovering semantic features in the literature: a foundation for building functional associations-Reference-Cited by-同舟云学术

Discovering semantic features in the literature: a foundation for building functional associations

Published:2006-01-26 Issue:1 Volume:7 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Chagoyen Monica,Carmona-Saez Pedro,Shatkay Hagit,Carazo Jose M,Pascual-Montano Alberto

Abstract

Abstract Background Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. Results We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. Conclusion The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-7-41.pdf

Reference46 articles.

1. Shatkay H, Feldman R: Mining the biomedical literature in the genomic era: An overview. J Comput Biol 2003, 10: 821–855.

2. Dobrokhotov PB, Goutte C, Veuthey AL, Gaussier E: Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation. Bioinformatics 2003, 19 Suppl 1: i91-i94.

3. Hearst MA: Untangling text data mining. Proc 37th annual meeting of the Association for Computational Linguistics 1999, 3–10.

4. Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001, 28: 21–28.

5. Jelier R, Jenster G, Dorssers LC, van der Eijk CC, van Mulligen EM, Mons B, Kors JA: Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes. Bioinformatics 2005, 21: 2049–2058.

Cited by 61 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Shifting Pattern Biclustering and Boolean Reasoning Symmetry;Symmetry;2023-10-26

2. PMIDigest: Interactive Review of Large Collections of PubMed Entries to Distill Relevant Information;Genes;2023-04-19

3. Theoretical backgrounds of Boolean reasoning-based binary n-clustering;Knowledge and Information Systems;2022-07-16

4. Consensus Algorithm for Bi-clustering Analysis;Computational Science – ICCS 2022;2022

5. A statistical framework for non-negative matrix factorization based on generalized dual divergence;Neural Networks;2021-08