Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks-Reference-Cited by-同舟云学术

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks

Published:2021-10-15 Issue:10 Volume:16 Page:e0258623
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Alachram Halima^ORCID,Chereda Hryhorii,Beißbarth Tim,Wingender Edgar,Stegmaier Philip

Abstract

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md.

Funder

Bundesministerium für Bildung und Forschung

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference50 articles.

1. Textpresso: an ontology-based information retrieval and extraction system for biological literature;H-M Müller;PLoS Biol,2004

2. Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. In: ISMB (supplement of bioinformatics). 2001. p. 74–82.

3. Spangler S, Wilkins AD, Bachman BJ, Nagarajan M, Dayaram T, Haas P, et al. Automated hypothesis generation based on mining scientific literature. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014. p. 1877–86.

4. ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature;MC Swain;J Chem Inf Model,2016

5. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. 2013. p. 3111–9.

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Artificial intelligence for drug repurposing against infectious diseases;Artificial Intelligence Chemistry;2024-12

2. Representation Learning of Biological Concepts: A Systematic Review;Current Bioinformatics;2024-01

3. Multi-scale Global Consistency Residue Feature Enhancement based Protein Structure Analysis;Proceedings of the 2023 9th International Conference on Communication and Information Processing;2023-12-14

4. Non-Overlapping Block Processing of Cancer Genes Data for Earlier Prediction of Breast Cancer Diseases using Regression Algorithms;2023 IEEE International Conference on ICT in Business Industry & Government (ICTBIG);2023-12-08

5. Evaluation of input data modality choices on functional gene embeddings;NAR Genomics and Bioinformatics;2023-10-11