Facilitating clinical research through automation: Combining optical character recognition with natural language processing-Reference-Cited by-同舟云学术

Facilitating clinical research through automation: Combining optical character recognition with natural language processing

Published:2022-05-24 Issue:5 Volume:19 Page:504-511
ISSN:1740-7745
Container-title:Clinical Trials
language:en
Short-container-title:Clinical Trials

Author:

Hom Julie¹^ORCID,Nikowitz Janet¹,Ottesen Rebecca¹,Niland Joyce C¹^ORCID

Affiliation:

1. Department of Diabetes & Cancer Discovery Science, City of Hope, Duarte, CA, USA

Abstract

Background/Aims Performance status is crucial for most clinical research, as an eligibility criterion, a comorbidity covariate, or a trial endpoint. Yet information on performance status often is embedded as free text within a patient’s electronic medical record, rather than coded directly, thereby making this concept extremely difficult to extract for research. Furthermore, performance status information frequently resides in outside reports, which are scanned into the electronic medical record along with thousands of clinic notes. The image format of scanned documents also is a major obstacle to the search and retrieval of information, as natural language processing cannot be applied to unstructured text within an image. We, therefore, utilized optical character recognition software to convert images to a searchable format, allowing the application of natural language processing to identify pertinent performance status data elements within scanned electronic medical records. Methods Our study cohort consisted of 189 subjects diagnosed with diffuse large B-cell lymphoma for whom performance status was a required data element for analysis of prognostic factors related to recurrence and survival. Manual abstraction of performance status was previously conducted by a clinical Subject Matter Expert, serving as the gold standard. Leveraging our data warehouse, we extracted relevant scanned electronic medical record documents and applied optical character recognition to these images using the ABBYY FineReader software. The Linguamatics i2e natural language processing software was then used to run queries for performance status against the corpus of electronic medical record documents. We evaluated our optical character recognition/natural language processing pipeline for accuracy and reduction in data extraction effort. Results We found that there was high accuracy and reduced time for extraction of performance status data by applying our optical character recognition/natural language processing pipeline. The transformed scanned documents from a random sample of patients yielded excellent precision, recall, and F score, with <1% incorrect results. Time savings from a second cohort showed that median time to review documents for patients with performance status data present was reduced by a third. The major time savings was in the review of those documents that in fact did not contain performance status information: median of 18 minutes versus 108 minutes for manual review, an 83% reduction in data abstraction effort. Conclusion By applying this optical character recognition/natural language processing pipeline, we achieved significant operational improvement and reduced time for information retrieval to support clinical research. Our study demonstrated that optical character recognition software provides an effective mechanism to transform scanned electronic medical record images to allow the application of natural language processing, yielding highly accurate data abstraction. We conclude that our optical character recognition/natural language processing pipeline can greatly facilitate research data abstraction by providing a highly focused data review, eliminating unnecessary manual review of the entire chart, and thus freeing time for abstracting other data elements requiring more human interpretation.

Publisher

SAGE Publications

Subject

Pharmacology,General Medicine

Link

http://journals.sagepub.com/doi/pdf/10.1177/17407745221093621

Reference25 articles.

1. The electronic health record as a clinical trials tool: Opportunities and challenges

2. A review of PHR, EMR and EHR integration: A more personalized healthcare and public health policy

3. Use of Electronic Health Record Data for Quality Reporting

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A machine learning framework for extracting information from biological pathway images in the literature;Metabolic Engineering;2024-11

2. A machine learning framework for extracting information from biological pathway images in the literature;2024-06-03

3. Development and Practical Applications of Computational Intelligence Technology;BioMedInformatics;2024-02-22

4. Development of novel optical character recognition system to reduce recording time for vital signs and prescriptions: A simulation-based study;PLOS ONE;2024-01-19

5. Performance of natural language processing in identifying adenomas from colonoscopy reports: a systematic review and meta-analysis;iGIE;2023-09