Decoding the NCCN Guidelines With AI: A Comparative Evaluation of ChatGPT-4.0 and Llama 2 in the Management of Thyroid Carcinoma-Reference-Cited by-同舟云学术

Decoding the NCCN Guidelines With AI: A Comparative Evaluation of ChatGPT-4.0 and Llama 2 in the Management of Thyroid Carcinoma

Published:2024-08-13 Issue: Volume: Page:
ISSN:0003-1348
Container-title:The American Surgeon™
language:en
Short-container-title:The American Surgeon™

Author:

Pandya Shivam¹,Bresler Tamir E.¹^ORCID,Wilson Tyler¹,Htway Zin²,Fujita Manabu¹³

Affiliation:

1. Department of Surgery, Los Robles Regional Medical Center, Thousand Oaks, CA, USA

2. Department of Laboratory, Los Robles Regional Medical Center, Thousand Oaks, CA, USA

3. General Surgical Associates, Thousand Oaks, CA, USA

Abstract

Introduction Artificial Intelligence (AI) has emerged as a promising tool in the delivery of health care. ChatGPT-4.0 (OpenAI, San Francisco, California) and Llama 2 (Meta, Menlo Park, CA) have each gained attention for their use in various medical applications. Objective This study aims to evaluate and compare the effectiveness of ChatGPT-4.0 and Llama 2 in assisting with complex clinical decision making in the diagnosis and treatment of thyroid carcinoma. Participants We reviewed the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for the management of thyroid carcinoma and formulated up to 3 complex clinical questions for each decision-making page. ChatGPT-4.0 and Llama 2 were queried in a reproducible manner. The answers were scored on a Likert scale: 5) Correct; 4) correct, with missing information requiring clarification; 3) correct, but unable to complete answer; 2) partially incorrect; 1) absolutely incorrect. Score frequencies were compared, and subgroup analysis was conducted on Correctness (defined as scores 1-2 vs 3-5) and Accuracy (scores 1-3 vs 4-5). Results In total, 58 pages of the NCCN Guidelines® were analyzed, generating 167 unique questions. There was no statistically significant difference between ChatGPT-4.0 and Llama 2 in terms of overall score (Mann-Whitney U-test; Mean Rank = 160.53 vs 174.47, P = 0.123), Correctness ( P = 0.177), or Accuracy ( P = 0.891). [Formula: see text] Conclusion ChatGPT-4.0 and Llama 2 demonstrate a limited but substantial capacity to assist with complex clinical decision making relating to the management of thyroid carcinoma, with no significant difference in their effectiveness.

Publisher

SAGE Publications

Link

https://journals.sagepub.com/doi/pdf/10.1177/00031348241269430

Reference21 articles.

1. AI revolution in healthcare and medicine and the (re-)emergence of inequalities and disadvantages for ageing population

2. Comparison of History of Present Illness Summaries Generated by a Chatbot and Senior Internal Medicine Residents

3. A Promising Start and Not a Panacea: ChatGPT's Early Impact and Potential in Medical Science and Biomedical Engineering Research

4. ChatGPT vs UpToDate: comparative study of usefulness and reliability of Chatbot in common clinical presentations of otorhinolaryngology–head and neck surgery