Abstract
Abstract
Background
Large language models (LLMs) that could efficiently screen and identify studies fulfilling specific criteria, as well as those capable of data extraction from publications, would streamline literature reviews and enhance knowledge discovery by lessening the burden on human reviewers.
Methods
We created an automated pipeline utilizing OpenAI GPT-4 32K API version “2023-05-15” to evaluate the accuracy of the LLM GPT-4 when responding to queries about published studies on HIV drug resistance (HIVDR) with and without an instruction sheet containing specialized HIVDR knowledge. We designed 60 questions pertaining to HIVDR and created markdown versions of 60 published HIVDR studies in PubMed. We presented the 60 studies to GPT-4 in four configurations: (1) all 60 questions simultaneously; (2) all 60 questions simultaneously with the instruction sheet; (3) each of the 60 questions individually; and (4) each of the 60 questions individually with the instruction sheet.
Results
GPT-4 achieved a median accuracy of 87% – 24% higher than when the answers to studies were permuted. The standard deviation of three replicates for the 60 questions ranged from 0 to 5.3% with a median of 1.2%. The instruction sheet did not increase GPT-4’s accuracy. GPT-4 was more likely to provide false positive answers when the 60 questions were submitted individually compared to when they were submitted together.
Conclusions
The inability of GPT-4 to utilize the instruction sheet suggests that more sophisticated prompt engineering approaches or the finetuning of an open source model are required to further improve the ability to answer questions about highly specialized research studies.
Publisher
Research Square Platform LLC
Reference17 articles.
1. Machine learning computational tools to assist the performance of systematic reviews: A mapping review;Cierco Jimenez R;BMC Med Res Methodol,2022
2. Using artificial intelligence methods for systematic review in health sciences: A systematic review;Blaizot A;Res Synth Methods,2022
3. The use of artificial intelligence for automating or semi-automating biomedical literature analyses: A scoping review;Santos ÁOdos;J Biomed Inform,2023
4. Artificial intelligence in systematic reviews: promising when appropriately used;Dijk SHB;BMJ Open,2023
5. Liang W, Zhang Y, Cao H, Wang B, Ding D, Yang X et al. Can large language models provide useful feedback on research papers? A large-scale empirical analysis [Internet]. arXiv; 2023 [cited 2023 Nov 14]. Available from: http://arxiv.org/abs/2310.01783.