A comprehensive evaluation of large language models in mining gene relations and pathway knowledge

Author:

Azam Muhammad12,Chen Yibo123,Arowolo Micheal Olaolu12,Liu Haowang12,Popescu Mihail134,Xu Dong123

Affiliation:

1. Department of Electrical Engineering and Computer Science University of Missouri Columbia Missouri USA

2. Bond Life Sciences Center University of Missouri Columbia Missouri USA

3. Institute for Data Science and Informatics University of Missouri Columbia Missouri USA

4. Department of Biomedical Informatics Biostatistics and Medical Epidemiology University of Missouri Columbia Missouri USA

Abstract

AbstractUnderstanding complex biological pathways, including gene–gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways cannot keep up with the exponential growth of new discoveries in the literature. Large‐scale language models (LLMs) trained on extensive text corpora contain rich biological information, and they can be mined as a biological knowledge graph. This study assesses 21 LLMs, including both application programming interface (API)‐based models and open‐source models in their capacities of retrieving biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway components. Results indicated a significant disparity in model performance. API‐based models GPT‐4 and Claude‐Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open‐source models lagged behind their API‐based counterparts, whereas Falcon‐180b and llama2‐7b had the highest F1 scores of 0.2787 and 0.1923 in gene regulatory relations, respectively. The KEGG pathway recognition had a Jaccard similarity index of 0.2237 for Falcon‐180b and 0.2207 for llama2‐7b. Our study suggests that LLMs are informative in gene network analysis and pathway mapping, but their effectiveness varies, necessitating careful model selection. This work also provides a case study and insight into using LLMs das knowledge graphs. Our code is publicly available at the website of GitHub (Muh‐aza).

Funder

National Institute of General Medical Sciences

National Institute of Diabetes and Digestive and Kidney Diseases

U.S. National Library of Medicine

Publisher

Wiley

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Foundation models for bioinformatics;Quantitative Biology;2024-07-24

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3