Evaluating AI Competence in Specialized Medicine: A Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist exam in Spain (Preprint)-Reference-Cited by-同舟云学术

Evaluating AI Competence in Specialized Medicine: A Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist exam in Spain (Preprint)

Published:2024-01-26 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Ros-Arlanzón Pablo^ORCID,Perez-Sempere Angel^ORCID

Abstract

BACKGROUND

With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine.

OBJECTIVE

This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, to assess the AI's capabilities and limitations in medical knowledge.

METHODS

We conducted a comparative analysis using the 2022 neurology specialist exam results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The exam consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom's Taxonomy. Statistical analysis of performance, including Kappa coefficient for response consistency, was performed.

RESULTS

Human participants exhibited a median score of 5.91, with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased inter-rater reliability, as reflected by a higher Kappa coefficient of 0.73, compared to ChatGPT-3.5's coefficient of 0.69.

CONCLUSIONS

This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT4's performance, surpassing the median human score in a rigorous neurology exam, marks a notable advancement, suggesting its potential as an effective tool in specialized medical education and assessment.

Publisher

JMIR Publications Inc.

Reference25 articles.

1. The ChatGPT (Generative Artificial Intelligence) Revolution Has Made Artificial Intelligence Approachable for Medical Professionals

2. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

3. Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care

4. Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations

5. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK Neurology Specialty Certificate Examination