Author:
Most Amoreena,Hu Mengxuan,Yang Huibo,Liu Tianming,Chen Xianyan,Li Sheng,Xu Steven,Liu Zhengliang,Sikora Andrea
Abstract
AbstractThe purpose of this study was to compare performance of ChatGPT (GPT-3.5), ChatGPT (GPT-4), Claude2, Llama2-7b, and Llama2-13b on 219 multiple-choice questions focusing on critical care pharmacotherapy. To further assess the ability of engineering LLMs to improve reasoning abilities and performance, we examined responses with a zero-shot Chain-of-Thought (CoT) approach, CoT prompting, and a custom built GPT (PharmacyGPT). A 219 multiple-choice questions focused on critical care pharmacotherapy topics used in Doctor of Pharmacy curricula from two accredited colleges of pharmacy was compiled for this study. A total of five LLMs were evaluated: ChatGPT (GPT-3.5), ChatGPT (GPT-4), Claude2, Llama2-7b, and Llama2-13b. The primary outcome was response accuracy. Of the five LLMs tested, GPT-4 showed the highest average accuracy rate at 71.6%. A larger variance indicates lower consistency and reduced confidence in its answers. Llama2-13b had the lowest variance (0.070) of all the LLMs, but performed with an accuracy of 41.5%. Following analaysis of overall accuracy, performance on knowledge- vs. skill-based questions were assessed. All five LLMs demonstrated higher accuracy on knowledge-based questions compared to skill-based questions. GPT-4 had the highest accuracy for knowledge- and skill-based questions, with an accuracy of 87% and 67%, respectively. Response accuracy from LLMs in the domain of clinical pharmacy can be improved by using prompt engineering techniques.
Publisher
Cold Spring Harbor Laboratory