LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models-Reference-Cited by-同舟云学术

LLM-AIx: An open source pipeline for Information Extraction from unstructured medical text based on privacy preserving Large Language Models

Published:2024-09-03 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Wiest Isabella Catharina,Wolf Fabian,Leßmann Marie-Elisabeth,van Treeck Marko,Ferber Dyke,Zhu Jiefu,Boehme Heiko,Bressem Keno K.,Ulrich Hannes,Ebert Matthias P.,Kather Jakob Nikolas

Abstract

AbstractIn clinical science and practice, text data, such as clinical letters or procedure reports, is stored in an unstructured way. This type of data is not a quantifiable resource for any kind of quantitative investigations and any manual review or structured information retrieval is time-consuming and costly. The capabilities of Large Language Models (LLMs) mark a paradigm shift in natural language processing and offer new possibilities for structured Information Extraction (IE) from medical free text. This protocol describes a workflow for LLM based information extraction (LLM-AIx), enabling extraction of predefined entities from unstructured text using privacy preserving LLMs. By converting unstructured clinical text into structured data, LLM-AIx addresses a critical barrier in clinical research and practice, where the efficient extraction of information is essential for improving clinical decision-making, enhancing patient outcomes, and facilitating large-scale data analysis.The protocol consists of four main processing steps: 1) Problem definition and data preparation, 2) data preprocessing, 3) LLM-based IE and 4) output evaluation. LLM-AIx allows integration on local hospital hardware without the need of transferring any patient data to external servers. As example tasks, we applied LLM-AIx for the anonymization of fictitious clinical letters from patients with pulmonary embolism. Additionally, we extracted symptoms and laterality of the pulmonary embolism of these fictitious letters. We demonstrate troubleshooting for potential problems within the pipeline with an IE on a real-world dataset, 100 pathology reports from the Cancer Genome Atlas Program (TCGA), for TNM stage extraction. LLM-AIx can be executed without any programming knowledge via an easy-to-use interface and in no more than a few minutes or hours, depending on the LLM model selected.

Publisher

Cold Spring Harbor Laboratory

Reference63 articles.

1. Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation

2. Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data

3. Additional Value From Free-Text Diagnoses in Electronic Health Records: Hybrid Dictionary and Machine Learning Classification Study;JMIR Med Inform,2024

4. Mining electronic health records: towards better research applications and clinical care

5. Is omission of free text records a possible source of data loss and bias in Clinical Practice Research Datalink studies? A case–control study