BACKGROUND
Adherence to evidence-based practice is indispensable in healthcare. Recently, the utility of artificial intelligence (AI)-based models in healthcare has been evaluated extensively. However, the lack of consensus guidelines for design and reporting of findings in these studies pose challenges to interpretation and synthesis of evidence.
OBJECTIVE
To propose a preliminary framework forming the basis of comprehensive guidelines to standardize reporting of AI-based studies in healthcare education and practice.
METHODS
A systematic literature review was conducted on Scopus, PubMed, and Google Scholar. The published records with “ChatGPT”, “Bing”, or “Bard” in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and gaps in reporting. Panel discussion followed to establish a unified and thorough checklist for reporting. Testing of the finalized checklist on the included records was done by two independent raters with Cohen’s κ as the method to evaluate the inter-rater reliability.
RESULTS
The final dataset that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included nine pertinent themes collectively referred to as “METRICS”: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and inter-rater reliability; (8) Count of queries executed to test the model; (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0±0.58. The tested METRICS score was acceptable by the range of Cohen’s κ of 0.558–0.962 (P<.001 for the nine tested items). Classified per item, the highest average METRICS score was recorded for the “Model” item, followed by “Specificity of the prompts and language used” item, while the lowest scores were recorded for the “Randomization of selecting the queries” item classified as sub-optimal and “Individual factors in selecting the queries and inter-rater reliability” item classified as satisfactory.
CONCLUSIONS
The findings highlighted the need for standardized reporting algorithms for AI-based studies in healthcare based on variability observed in methodologies and reporting. The proposed METRICS checklist could be the preliminary helpful step to establish a universally accepted approach to standardize reporting in AI-based studies in healthcare, a swiftly evolving research topic.