Author:
Liu Yong,Ju Shenggen,Wang Junfeng
Abstract
Abstract
Background
Telemedicine has experienced rapid growth in recent years, aiming to enhance medical efficiency and reduce the workload of healthcare professionals. During the COVID-19 pandemic in 2019, it became especially crucial, enabling remote screenings and access to healthcare services while maintaining social distancing. Online consultation platforms have emerged, but the demand has strained the availability of medical professionals, directly leading to research and development in automated medical consultation. Specifically, there is a need for efficient and accurate medical dialogue summarization algorithms to condense lengthy conversations into shorter versions focused on relevant medical facts. The success of large language models like generative pre-trained transformer (GPT)-3 has recently prompted a paradigm shift in natural language processing (NLP) research. In this paper, we will explore its impact on medical dialogue summarization.
Methods
We present the performance and evaluation results of two approaches on a medical dialogue dataset. The first approach is based on fine-tuned pre-trained language models, such as bert-based summarization (BERTSUM) and bidirectional auto-regressive Transformers (BART). The second approach utilizes a large language models (LLMs) GPT-3.5 with inter-context learning (ICL). Evaluation is conducted using automated metrics such as ROUGE and BERTScore.
Results
In comparison to the BART and ChatGPT models, the summaries generated by the BERTSUM model not only exhibit significantly lower ROUGE and BERTScore values but also fail to pass the testing for any of the metrics in manual evaluation. On the other hand, the BART model achieved the highest ROUGE and BERTScore values among all evaluated models, surpassing ChatGPT. Its ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore values were 14.94%, 53.48%, 32.84%, and 6.73% higher respectively than ChatGPT’s best results. However, in the manual evaluation by medical experts, the summaries generated by the BART model exhibit satisfactory performance only in the “Readability” metric, with less than 30% passing the manual evaluation in other metrics. When compared to the BERTSUM and BART models, the ChatGPT model was evidently more favored by human medical experts.
Conclusion
On one hand, the GPT-3.5 model can manipulate the style and outcomes of medical dialogue summaries through various prompts. The generated content is not only better received than results from certain human experts but also more comprehensible, making it a promising avenue for automated medical dialogue summarization. On the other hand, automated evaluation mechanisms like ROUGE and BERTScore fall short in fully assessing the outputs of large language models like GPT-3.5. Therefore, it is necessary to research more appropriate evaluation criteria.
Funder
key project of the National Natural Science Foundation
Publisher
Springer Science and Business Media LLC
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献