Comparison of Medical Research Abstracts Written by Surgical Trainees and Senior Surgeons or Generated by Large Language Models-Reference-Cited by-同舟云学术

Comparison of Medical Research Abstracts Written by Surgical Trainees and Senior Surgeons or Generated by Large Language Models

Published:2024-08-02 Issue:8 Volume:7 Page:e2425373
ISSN:2574-3805
Container-title:JAMA Network Open
language:en
Short-container-title:JAMA Netw Open

Author:

Holland Alexis M.¹,Lorenz William R.¹,Cavanagh Jack C.²,Smart Neil J.³,Ayuso Sullivan A.¹,Scarola Gregory T.¹,Kercher Kent W.¹,Jorgensen Lars N.⁴,Janis Jeffrey E.⁵,Fischer John P.⁶,Heniford B. Todd¹

Affiliation:

1. Division of Gastrointestinal and Minimally Invasive Surgery, Department of Surgery, Atrium Health Carolinas Medical Center, Charlotte, North Carolina

2. Department of Economics, Massachusetts Institute of Technology, Cambridge

3. Division of Colorectal Surgery, Department of Surgery, Royal Devon & Exeter Hospital, Exeter, Devon, United Kingdom

4. Department of Clinical Medicine, University of Copenhagen, Bispedjerg & Frederiksberg Hospital, Copenhagen, Denmark

5. Division of Plastic and Reconstructive Surgery, The Ohio State University Wexner Medical Center, Columbus

6. Division of Plastic Surgery, University of Pennsylvania Health System, Philadelphia

Abstract

ImportanceArtificial intelligence (AI) has permeated academia, especially OpenAI Chat Generative Pretrained Transformer (ChatGPT), a large language model. However, little has been reported on its use in medical research.ObjectiveTo assess a chatbot’s capability to generate and grade medical research abstracts.Design, Setting, and ParticipantsIn this cross-sectional study, ChatGPT versions 3.5 and 4.0 (referred to as chatbot 1 and chatbot 2) were coached to generate 10 abstracts by providing background literature, prompts, analyzed data for each topic, and 10 previously presented, unassociated abstracts to serve as models. The study was conducted between August 2023 and February 2024 (including data analysis).ExposureAbstract versions utilizing the same topic and data were written by a surgical trainee or a senior physician or generated by chatbot 1 and chatbot 2 for comparison. The 10 training abstracts were written by 8 surgical residents or fellows, edited by the same senior surgeon, at a high-volume hospital in the Southeastern US with an emphasis on outcomes-based research. Abstract comparison was then based on 10 abstracts written by 5 surgical trainees within the first 6 months of their research year, edited by the same senior author.Main Outcomes and MeasuresThe primary outcome measurements were the abstract grades using 10- and 20-point scales and ranks (first to fourth). Abstract versions by chatbot 1, chatbot 2, junior residents, and the senior author were compared and judged by blinded surgeon-reviewers as well as both chatbot models. Five academic attending surgeons from Denmark, the UK, and the US, with extensive experience in surgical organizations, research, and abstract evaluation served as reviewers.ResultsSurgeon-reviewers were unable to differentiate between abstract versions. Each reviewer ranked an AI-generated version first at least once. Abstracts demonstrated no difference in their median (IQR) 10-point scores (resident, 7.0 [6.0-8.0]; senior author, 7.0 [6.0-8.0]; chatbot 1, 7.0 [6.0-8.0]; chatbot 2, 7.0 [6.0-8.0]; P = .61), 20-point scores (resident, 14.0 [12.0-7.0]; senior author, 15.0 [13.0-17.0]; chatbot 1, 14.0 [12.0-16.0]; chatbot 2, 14.0 [13.0-16.0]; P = .50), or rank (resident, 3.0 [1.0-4.0]; senior author, 2.0 [1.0-4.0]; chatbot 1, 3.0 [2.0-4.0]; chatbot 2, 2.0 [1.0-3.0]; P = .14). The abstract grades given by chatbot 1 were comparable to the surgeon-reviewers’ grades. However, chatbot 2 graded more favorably than the surgeon-reviewers and chatbot 1. Median (IQR) chatbot 2-reviewer grades were higher than surgeon-reviewer grades of all 4 abstract versions (resident, 14.0 [12.0-17.0] vs 16.9 [16.0-17.5]; P = .02; senior author, 15.0 [13.0-17.0] vs 17.0 [16.5-18.0]; P = .03; chatbot 1, 14.0 [12.0-16.0] vs 17.8 [17.5-18.5]; P = .002; chatbot 2, 14.0 [13.0-16.0] vs 16.8 [14.5-18.0]; P = .04). When comparing the grades of the 2 chatbots, chatbot 2 gave higher median (IQR) grades for abstracts than chatbot 1 (resident, 14.0 [13.0-15.0] vs 16.9 [16.0-17.5]; P = .003; senior author, 13.5 [13.0-15.5] vs 17.0 [16.5-18.0]; P = .004; chatbot 1, 14.5 [13.0-15.0] vs 17.8 [17.5-18.5]; P = .003; chatbot 2, 14.0 [13.0-15.0] vs 16.8 [14.5-18.0]; P = .01).Conclusions and RelevanceIn this cross-sectional study, trained chatbots generated convincing medical abstracts, undifferentiable from resident or senior author drafts. Chatbot 1 graded abstracts similarly to surgeon-reviewers, while chatbot 2 was less stringent. These findings may assist surgeon-scientists in successfully implementing AI in medical research.

Publisher

American Medical Association (AMA)

Link

https://jamanetwork.com/journals/jamanetworkopen/articlepdf/2821876/holland_2024_oi_240796_1721926251.56589.pdf

Reference82 articles.

1. ChatGPT and a new academic reality: artificial intelligence-written research papers and the ethics of the large language models in scholarly publishing.;Lund;J Assoc Inf Sci Technol,2023

2. ChatGPT: five priorities for research.;van Dis;Nature,2023

3. Large language models show human-like content biases in transmission chain experiments.;Acerbi;Proc Natl Acad Sci U S A,2023

4. Artificial intelligence takes center stage: exploring the capabilities and implications of ChatGPT and other AI-assisted technologies in scientific research and education.;Borger;Immunol Cell Biol,2023

5. Plagiarism in the age of massive generative pre-trained transformers (GPT-3).;Dehouche;Ethics Sci Environ Polit,2021