Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs-Reference-Cited by-同舟云学术

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

Published:2024-02-20 Issue:1 Volume:7 Page:
ISSN:2398-6352
Container-title:npj Digital Medicine
language:en
Short-container-title:npj Digit. Med.

Author:

Wang Li^ORCID,Chen Xi,Deng XiangWen,Wen Hao,You MingKe,Liu WeiZhi,Li Qi^ORCID,Li Jian^ORCID

Abstract

AbstractThe use of large language models (LLMs) in clinical medicine is currently thriving. Effectively transferring LLMs’ pertinent theoretical knowledge from computer science to their application in clinical medicine is crucial. Prompt engineering has shown potential as an effective method in this regard. To explore the application of prompt engineering in LLMs and to examine the reliability of LLMs, different styles of prompts were designed and used to ask different LLMs about their agreement with the American Academy of Orthopedic Surgeons (AAOS) osteoarthritis (OA) evidence-based guidelines. Each question was asked 5 times. We compared the consistency of the findings with guidelines across different evidence levels for different prompts and assessed the reliability of different prompts by asking the same question 5 times. gpt-4-Web with ROT prompting had the highest overall consistency (62.9%) and a significant performance for strong recommendations, with a total consistency of 77.5%. The reliability of the different LLMs for different prompts was not stable (Fleiss kappa ranged from −0.002 to 0.984). This study revealed that different prompts had variable effects across various models, and the gpt-4-Web with ROT prompt was the most consistent. An appropriate prompt could improve the accuracy of responses to professional medical questions.

Publisher

Springer Science and Business Media LLC

Link

https://www.nature.com/articles/s41746-024-01029-4.pdf

Reference43 articles.

1. Lee, P., Bubeck, S. & Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N. Engl. J. Med. 388, 1233–1239 (2023).

2. Waisberg, E. et al. GPT-4: a new era of artificial intelligence in medicine. Ir. J. Med. Sci. 192, 3197–3200 (2023).

3. Scanlon, M., Breitinger, F., Hargreaves, C., Hilgert, J.-N. & Sheppard, J. ChatGPT for digital forensic investigation: The good, the bad, and the unknown. Forensic Science International: Digital Investigation (2023).

4. Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA 330, 78–80 (2023).

5. Cai, L. Z. et al. Performance of Generative Large Language Models on Ophthalmology Board Style Questions. Am. J. Ophthalmol. 254, 141–149 (2023).

Cited by 19 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ChatGPT compared to national guidelines for management of ovarian cancer: Did ChatGPT get it right? – A Memorial Sloan Kettering Cancer Center Team Ovary study;Gynecologic Oncology;2024-10

2. Conceptual review of outcome metrics and measures used in clinical evaluation of artificial intelligence in radiology;La radiologia medica;2024-09-03

3. Large language model to multimodal large language model: A journey to shape the biological macromolecules to biological sciences and medicine;Molecular Therapy - Nucleic Acids;2024-09

4. Encouragement vs. liability: How prompt engineering influences ChatGPT-4's radiology exam performance;Clinical Imaging;2024-09

5. Which curriculum components do medical students find most helpful for evaluating AI outputs?;2024-08-26