Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences-Reference-Cited by-同舟云学术

Exploring the potential of ChatGPT in medical dialogue summarization: a study on consistency with human preferences

Published:2024-03-14 Issue:1 Volume:24 Page:
ISSN:1472-6947
Container-title:BMC Medical Informatics and Decision Making
language:en
Short-container-title:BMC Med Inform Decis Mak

Author:

Liu Yong,Ju Shenggen,Wang Junfeng

Abstract

Abstract Background Telemedicine has experienced rapid growth in recent years, aiming to enhance medical efficiency and reduce the workload of healthcare professionals. During the COVID-19 pandemic in 2019, it became especially crucial, enabling remote screenings and access to healthcare services while maintaining social distancing. Online consultation platforms have emerged, but the demand has strained the availability of medical professionals, directly leading to research and development in automated medical consultation. Specifically, there is a need for efficient and accurate medical dialogue summarization algorithms to condense lengthy conversations into shorter versions focused on relevant medical facts. The success of large language models like generative pre-trained transformer (GPT)-3 has recently prompted a paradigm shift in natural language processing (NLP) research. In this paper, we will explore its impact on medical dialogue summarization. Methods We present the performance and evaluation results of two approaches on a medical dialogue dataset. The first approach is based on fine-tuned pre-trained language models, such as bert-based summarization (BERTSUM) and bidirectional auto-regressive Transformers (BART). The second approach utilizes a large language models (LLMs) GPT-3.5 with inter-context learning (ICL). Evaluation is conducted using automated metrics such as ROUGE and BERTScore. Results In comparison to the BART and ChatGPT models, the summaries generated by the BERTSUM model not only exhibit significantly lower ROUGE and BERTScore values but also fail to pass the testing for any of the metrics in manual evaluation. On the other hand, the BART model achieved the highest ROUGE and BERTScore values among all evaluated models, surpassing ChatGPT. Its ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore values were 14.94%, 53.48%, 32.84%, and 6.73% higher respectively than ChatGPT’s best results. However, in the manual evaluation by medical experts, the summaries generated by the BART model exhibit satisfactory performance only in the “Readability” metric, with less than 30% passing the manual evaluation in other metrics. When compared to the BERTSUM and BART models, the ChatGPT model was evidently more favored by human medical experts. Conclusion On one hand, the GPT-3.5 model can manipulate the style and outcomes of medical dialogue summaries through various prompts. The generated content is not only better received than results from certain human experts but also more comprehensible, making it a promising avenue for automated medical dialogue summarization. On the other hand, automated evaluation mechanisms like ROUGE and BERTScore fall short in fully assessing the outputs of large language models like GPT-3.5. Therefore, it is necessary to research more appropriate evaluation criteria.

Funder

key project of the National Natural Science Foundation

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s12911-024-02481-8.pdf

Reference55 articles.

1. Jo HS, Park K, Jung SM. A scoping review of consumer needs for cancer information. Patient Educ Couns. 2019;102(7):1237–50.

2. Finney Rutten LJ, Blake KD, Greenberg-Worisek AJ, Allen SV, Moser RP, Hesse BW. Online health information seeking among US adults: measuring progress toward a healthy people 2020 objective. Public Health Rep. 2019;134(6):617–25.

3. Jain R, Jangra A, Saha S, Jatowt A. A survey on medical document summarization. 2022. arXiv preprint arXiv:2212.01669

4. Navarro DF, Dras M, Berkovsky S. Few-shot fine-tuning SOTA summarization models for medical dialogues. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop. 2022. p. 254–266. https://aclanthology.org/2022.naacl-srw.32/.

5. Hollander JE, Carr BG. Virtually perfect? Telemedicine for COVID-19. N Engl J Med. 2020;382(18):1679–81.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimizing Human–AI Collaboration in Chemistry: A Case Study on Enhancing Generative AI Responses through Prompt Engineering;Chemistry;2024-08-11

2. Large Language Model‐Based Chatbots in Higher Education;Advanced Intelligent Systems;2024-08-11

3. Exploring the Efficacy of Large Language Models in Summarizing Mental Health Counseling Sessions: Benchmark Study;JMIR Mental Health;2024-07-23

4. Applications of ChatGPT in Critical Care Medicine: Opportunities, Challenges, and Future Prospects (Preprint);2024-05-10