BACKGROUND
: In healthcare settings, especially in high-pressure environments like Emergency situations, the ability to document and communicate patient information rapidly and accurately is crucial. Traditional methods for manual documentation are often time-consuming and prone to errors, which can adversely affect patient outcomes. To address these challenges, there is growing interest in integrating advanced technologies, especially Large Language Models (LLMs), into medical communication systems. However, deploying LLMs in clinical environments presents unique challenges, including the need to ensure the accuracy of medical content and to mitigate the risk of generating irrelevant or misleading information.
OBJECTIVE
This paper aims to address these challenges by developing a Natural Language Processing (NLP) pipeline for the extraction of text from German rescue services treatment dialogues. The objectives are twofold: (1) to generate realistic, medically relevant dialogues where the ground truth is known, and (2) to accurately extract essential information from these dialogues to populate emergency protocols.
METHODS
This study utilizes the MIMIC-IV-ED dataset, a de-identified, publicly available resource, to generate synthetic dialogue data for emergency department scenarios. By selecting and anonymizing data from 100 patients, we created a baseline for generating realistic dialogues and evaluating an NLP pipeline. We applied the Post Randomization Method (PRAM) for non-mechanical data perturbation, ensuring patient privacy and data utility. Dialogue generation was conducted in two stages: initial generation using the "Zephyr-7b-beta" model, followed by refinement and translation into German using GPT-4 Turbo. A Retrieval-Augmented Generation (RAG) approach was developed for extracting relevant information from these dialogues, involving chunking, embedding, and dynamic prompt templates. The model's performance was evaluated through manual review and sentiment analysis, ensuring that the generated dialogues maintained clinical relevance and emotional accuracy.
RESULTS
The data generation pipeline produced 100 dialogues, with initial English dialogues averaging 2,000 tokens and German dialogues 4,000 tokens. Manual evaluation identified certain redundancies and formal language in the German dialogues. Sentiment analysis revealed a reduction in negative sentiment from 67% to 59% and an increase in positive sentiment from 27% to 38%, which may negatively impact text extraction, as positive sentiments may not align well with identifying critical topics such as suicidal thoughts. The RAG-based extraction system achieved high precision and recall in both nominal and numerical features in the initial dialogues, with F1-scores ranging from 86.21% to 100%. However, performance declined in the refined dialogues, with notable drops in precision, particularly for "Diagnosis" (60.82%) and "Pain Score" (57.61%).
CONCLUSIONS
The results of the study underscore the system's robust capabilities in processing structured data efficiently, demonstrating its strength in managing well-defined, quantitative information. However, the findings also reveal limitations in the system’s ability to handle nuanced clinical language, particularly when it comes to non-English and non-Chinese languages.