Validation of Semantic Analyses of Unstructured Medical Data for Research Purposes

Author:

Pokora Roman Michael1,Le Cornet Lucian12,Daumke Philipp3,Mildenberger Peter4,Zeeb Hajo5,Blettner Maria6

Affiliation:

1. Institute for Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz

2. Studienzentrale, Nationales Centrum für Tumorerkrankungen Heidelberg, Heidelberg

3. Averbis GmbH, Freiburg

4. Klinik und Poliklinik für Diagnostische und Interventionelle Radiologie, University Medical Center of the Johannes Gutenberg University Mainz, Mainz

5. Leibniz-Institut für Präventionsforschung und Epidemiologie (BIPS), Prevention and Evaluation, Bremen

6. Institut fur Medizinische Biometrie Epidemiologie und Informatik, Johannes-Gutenberg Universität Mainz, Mainz

Abstract

Abstract Background In secondary data there are often unstructured free texts. The aim of this study was to validate a text mining system to extract unstructured medical data for research purposes. Methods From a radiological department, 1,000 out of 7,102 CT findings were randomly selected. These were manually divided into defined groups by 2 physicians. For automated tagging and reporting, the text analysis software Averbis Extraction Platform (AEP) was used. Special features of the system are a morphological analysis for the decomposition of compound words as well as the recognition of noun phrases, abbreviations and negated statements. Based on the extracted standardized keywords, findings reports were assigned to the given findings groups using machine learning methods. To assess the reliability and validity of the automated process, the automated and two independent manual mappings were compared for matches in multiple runs. Results Manual classification was too time-consuming. In the case of automated keywording, the classification according to ICD-10 turned out to be unsuitable for our data. It also showed that the keyword search does not deliver reliable results. Computer-aided text mining and machine learning resulted in reliable results. The inter-rater reliability of the two manual classifications, as well as the machine and manual classification was very high. Both manual classifications were consistent in 93% of all findings. The kappa coefficient is 0.89 [95% confidence interval (CI) 0.87–0.92]. The automatic classification agreed with the independent, second manual classification in 86% of all findings (Kappa coefficient 0.79 [95% CI 0.75–0.81]). Discussion The classification of the software AEP was very good. In our study, however, it followed a systematic pattern. Most misclassifications were found in findings that indicate an increased risk of cancer. The free-text structure of the findings raises concerns about the feasibility of a purely automated analysis. The combination of human intellect and intelligent, adaptive software appears most suitable for mining unstructured but important textual information for research.

Publisher

Georg Thieme Verlag KG

Subject

Public Health, Environmental and Occupational Health

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3