BACKGROUND
The growth of studies evaluating ChatGPT's performance in exams swamped the medical education community. However, it has been proved from low to high-stakes examination, affecting the reliability and validity of findings. To ensure reliability and bring a final consensus, we opted to synthesize the evidence of ChatGPT's performance under high-stakes examinations, namely, National Licensing Medical Examinations (NLME).
OBJECTIVE
To evaluate ChatGPT’s NLMEs performance and assess whether it could achieve a license to practice in various countries.
METHODS
We searched the Pubmed and Scopus databases for studies that evaluated ChatGPT's performance in NLMEs. In addition to the reference list and in Google Scholar. Studies were screened, and the accuracy rate (performance) of ChatGPT was extracted, as well as other study characteristics.
RESULTS
We identified 37 studies that evaluated ChatGPT's performance across 18 NLMEs. Most studies evaluated the performance of ChatGPT in the NLME of the United States, China, and Japan. While the majority of studies used official datasets, others used unofficial ones from third parties, and a scarce number of studies used prompting techniques. GPT-4 was superior to GPT-3.5 in all NLMEs and could pass all of them. GPT-4 overperformed the average performance of examinees' in most studies, except the Japan NLME.
CONCLUSIONS
Current evidence suggests that ChatGPT can pass 18 NLMEs, surpassing almost all candidates, and, if possible, receive a "global medical license." Further research should move towards using ChatGPT as GPT-4o in performance assessment and exploring the potential of ChatGPT for NLMEs development and validation. Moreover, our findings represent a call for reimagining assessment in medical education.
CLINICALTRIAL
Not applicable