BACKGROUND
ChatGPT-4 is the most advanced large language model (LLM) to date, with prior iterations passing medical licensing exams, providing clinical decision support and improved diagnostics. Although limited, past studies of ChatGPT’s performance found the AI could pass American Heart Association (AHA) Advanced Cardiac Life Support (ACLS) exams with modifications. ChatGPT-4’s accuracy has not been studied in more complex clinical scenarios. As heart disease and cardiac arrest remain leading causes of morbidity and mortality in the US, finding technologies that help increase adherence to ACLS algorithms, which improves survival outcomes is critical.
OBJECTIVE
Our study examines the accuracy of ChatGPT-4 in following ACLS guidelines for bradycardia and cardiac arrest.
METHODS
We evaluated the accuracy of ChatGPT-4's responses to two simulations based on the 2020 AHA ACLS guidelines with three primary outcomes of interest: the mean individual step accuracy, the accuracy score per simulation attempt, and the accuracy score for each algorithm. For each simulation step, ChatGPT was scored for correctness (1 point) or incorrectness (0 points). Each simulation was conducted 20 times
RESULTS
ChatGPT’s average accuracy for cardiac arrest was 69% (IQR: 7; 67% - 75%) and for bradycardia was 43% (IQR: 17; 33% - 50%). We found ChatGPT’s outputs varied despite consistent input, the same actions were persistently missed, repetitive overemphasis hindered guidance, and erroneous medication information was presented.
CONCLUSIONS
This study highlights the need for consistent and reliable guidance to prevent potential medical errors and optimize the application of ChatGPT to enhance its reliability and effectiveness in clinical practice.