ChIP-GPT: a managed large language model for robust data extraction from biomedical database records-Reference-Cited by-同舟云学术

ChIP-GPT: a managed large language model for robust data extraction from biomedical database records

Published:2024-01-22 Issue:2 Volume:25 Page:
ISSN:1467-5463
Container-title:Briefings in Bioinformatics
language:en
Short-container-title:

Author:

Cinquin Olivier¹^ORCID

Affiliation:

1. Department of Developmental and Cell Biology, University of California at Irvine , 4203 McGaugh Hall, Irvine, CA 92697 , United States

Abstract

Abstract Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors—a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, ‘few-shot’ examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90–94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bib/article-pdf/25/2/bbad535/56579821/bbad535.pdf

Reference30 articles.

1. Language models are unsupervised multitask learners;Radford,2019

2. BioGPT: generative pre-trained transformer for biomedical text generation and mining;Luo;Brief Bioinform

3. Thinking about GPT-3 In-Context Learning for Biomedical IE? Think Again

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Artificial Intelligence in Newborn Medicine;Newborn;2024-06-21