Abstract
Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX’s Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX’s capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX’s real-world applicability in inferring E. coli gene circuits.
Publisher
Public Library of Science (PLoS)
Reference61 articles.
1. Biomedical Relation Extraction: From Binary to Complex.;D Zhou;Computational and mathematical methods in medicine.,2014
2. Neural network-based approaches for biomedical relation classification: A review;Y Zhang;Journal of Biomedical Informatics,2019
3. Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges.;A Singhal;Database. 2016,2016
4. A statistical analysis of the TRANSFAC database.;GB Fogel;BioSystems.,2005