BACKGROUND
Detection, analysis, and organization of unstructured data within Electronic Health Records (EHRs) through Natural Language Processing (NLP) and machine learning (ML), has become of vital interest to fully leverage all the information created during clinical practice.
OBJECTIVE
The aim of this study was to assess the performance of EHRead® (a technology that applies NLP and ML) when identifying mentions of cardiovascular (CV)-phenotypes in patients' EHRs, and specific CV-related linguistic variables at patient and record level.
METHODS
This was a validation study using data from three hospitals in Spain, from 2012 to 2017. A predefined set of clinical entities grouped under different CV-phenotypes, and CV-related linguistic variables, were extracted from unstructured EHRs. Guideline´s reliability was validated by the Inter-Annotator Agreement, and a gold standard corpora was developed to evaluate the EHRead's performance in terms of Precision (P), Recall (R), and F1-score.
RESULTS
The number of correctly identified CV-phenotypes were: 249 out of 280 mentions (P=0.93), 53 out of 57 (P=0.98), and 165 out of 178 (P= 0.99), for sites 01, 02, and 03, respectively, with a F1-score >0.80 in almost all CV-phenotype detections. The IAA had a F1-score ≥0.96 across all sites. EHRead® demonstrated a high R-value when detecting patients and records containing the CV-related linguistic variables.
CONCLUSIONS
Our study demonstrated EHRead® technology's ability and high performance in identifying CV-phenotypes and retrieve patients and records in the CV domain in Spanish. This research lays the methodological foundations for future clinical research to generate deep real-world insights relevant to patient care and enable better treatment decisions.