Use of a large language model with instruction‐tuning for reliable clinical frailty scoring

Author:

Kee Xiang Lee Jamie1ORCID,Sng Gerald Gui Ren23,Lim Daniel Yan Zheng34,Tung Joshua Yi Min35,Abdullah Hairil Rizal36,Chowdury Anupama Roy1

Affiliation:

1. Department of Geriatric Medicine Singapore General Hospital Singapore Singapore

2. Department of Endocrinology Singapore General Hospital Singapore Singapore

3. Data Science and Artificial Intelligence Laboratory Singapore General Hospital Singapore Singapore

4. Department of Gastroenterology Singapore General Hospital Singapore Singapore

5. Department of Urology Singapore General Hospital Singapore Singapore

6. Department of Anaesthesiology Singapore General Hospital Singapore Singapore

Abstract

AbstractBackgroundFrailty is an important predictor of health outcomes, characterized by increased vulnerability due to physiological decline. The Clinical Frailty Scale (CFS) is commonly used for frailty assessment but may be influenced by rater bias. Use of artificial intelligence (AI), particularly Large Language Models (LLMs) offers a promising method for efficient and reliable frailty scoring.MethodsThe study utilized seven standardized patient scenarios to evaluate the consistency and reliability of CFS scoring by OpenAI's GPT‐3.5‐turbo model. Two methods were tested: a basic prompt and an instruction‐tuned prompt incorporating CFS definition, a directive for accurate responses, and temperature control. The outputs were compared using the Mann–Whitney U test and Fleiss' Kappa for inter‐rater reliability. The outputs were compared with historic human scores of the same scenarios.ResultsThe LLM's median scores were similar to human raters, with differences of no more than one point. Significant differences in score distributions were observed between the basic and instruction‐tuned prompts in five out of seven scenarios. The instruction‐tuned prompt showed high inter‐rater reliability (Fleiss' Kappa of 0.887) and produced consistent responses in all scenarios. Difficulty in scoring was noted in scenarios with less explicit information on activities of daily living (ADLs).ConclusionsThis study demonstrates the potential of LLMs in consistently scoring clinical frailty with high reliability. It demonstrates that prompt engineering via instruction‐tuning can be a simple but effective approach for optimizing LLMs in healthcare applications. The LLM may overestimate frailty scores when less information about ADLs is provided, possibly as it is less subject to implicit assumptions and extrapolation than humans. Future research could explore the integration of LLMs in clinical research and frailty‐related outcome prediction.

Publisher

Wiley

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3