Automated recognition of functional compound-protein relationships in literature

Author:

Döring Kersten,Qaseem Ammar,Telukunta Kiran KORCID,Becer Michael,Thomas PhilippeORCID,Günther StefanORCID

Abstract

AbstractMotivationMuch effort has been invested in the identification of protein-protein interactions using text mining and machine learning methods. The extraction of functional relationships between chemical compounds and proteins from literature has received much less attention, and no ready-to-use open-source software is so far available for this task.MethodWe created a new benchmark dataset of 2,753 sentences from abstracts containing annotations of proteins, small molecules, and their relationships. Two kernel methods were applied to classify these relationships as functional or non-functional, named shallow linguistic and all-paths graph kernel. Furthermore, the benefit of interaction verbs in sentences was evaluated.ResultsThe cross-validation of the all-paths graph kernel (AUC value: 84.2%, F1 score: 81.8%) shows slightly better results than the shallow linguistic kernel (AUC value: 81.6%, F1 score: 79.7%) on our benchmark dataset. Both models achieve state-of-the-art performance in the research area of relation extraction. Furthermore, the combination of shallow linguistic and all-paths graph kernel could further increase the overall performance. We used each of the two kernels to identify functional relationships in all PubMed abstracts (28 million) and provide the results, including recorded processing time.AvailabilityThe software for the tested kernels, the benchmark, the processed 28 million PubMed abstracts, all evaluation scripts, as well as the scripts for processing the complete PubMed database are freely available at https://github.com/KerstenDoering/CPI-Pipeline.Author summaryText mining aims at organizing large sets of unstructured text data to provide efficient information extraction. Particularly in the area of drug discovery, the knowledge about small molecules and their interactions with proteins is of crucial importance to understand the drug effects on cells, tissues, and organisms. This data is normally hidden in written articles, which are published in journals with a focus on life sciences. In this publication, we show how text mining methods can be used to extract data about functional interactions between small molecules and proteins from texts. We created a new dataset with annotated sentences of scientific abstracts for the purpose of training two diverse machine learning methods (kernels), and successfully classified compound-protein pairs as functional and non-functional relations, i.e. no interactions. Our newly developed benchmark dataset and the pipeline for information extraction are freely available for download. Furthermore, we show that the software can be easily up-scaled to process large datasets by applying the approach to 28 million abstracts.

Publisher

Cold Spring Harbor Laboratory

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3