Facilitating clinical research through automation: Combining optical character recognition with natural language processing

Author:

Hom Julie1ORCID,Nikowitz Janet1,Ottesen Rebecca1,Niland Joyce C1ORCID

Affiliation:

1. Department of Diabetes & Cancer Discovery Science, City of Hope, Duarte, CA, USA

Abstract

Background/Aims Performance status is crucial for most clinical research, as an eligibility criterion, a comorbidity covariate, or a trial endpoint. Yet information on performance status often is embedded as free text within a patient’s electronic medical record, rather than coded directly, thereby making this concept extremely difficult to extract for research. Furthermore, performance status information frequently resides in outside reports, which are scanned into the electronic medical record along with thousands of clinic notes. The image format of scanned documents also is a major obstacle to the search and retrieval of information, as natural language processing cannot be applied to unstructured text within an image. We, therefore, utilized optical character recognition software to convert images to a searchable format, allowing the application of natural language processing to identify pertinent performance status data elements within scanned electronic medical records. Methods Our study cohort consisted of 189 subjects diagnosed with diffuse large B-cell lymphoma for whom performance status was a required data element for analysis of prognostic factors related to recurrence and survival. Manual abstraction of performance status was previously conducted by a clinical Subject Matter Expert, serving as the gold standard. Leveraging our data warehouse, we extracted relevant scanned electronic medical record documents and applied optical character recognition to these images using the ABBYY FineReader software. The Linguamatics i2e natural language processing software was then used to run queries for performance status against the corpus of electronic medical record documents. We evaluated our optical character recognition/natural language processing pipeline for accuracy and reduction in data extraction effort. Results We found that there was high accuracy and reduced time for extraction of performance status data by applying our optical character recognition/natural language processing pipeline. The transformed scanned documents from a random sample of patients yielded excellent precision, recall, and F score, with <1% incorrect results. Time savings from a second cohort showed that median time to review documents for patients with performance status data present was reduced by a third. The major time savings was in the review of those documents that in fact did not contain performance status information: median of 18 minutes versus 108 minutes for manual review, an 83% reduction in data abstraction effort. Conclusion By applying this optical character recognition/natural language processing pipeline, we achieved significant operational improvement and reduced time for information retrieval to support clinical research. Our study demonstrated that optical character recognition software provides an effective mechanism to transform scanned electronic medical record images to allow the application of natural language processing, yielding highly accurate data abstraction. We conclude that our optical character recognition/natural language processing pipeline can greatly facilitate research data abstraction by providing a highly focused data review, eliminating unnecessary manual review of the entire chart, and thus freeing time for abstracting other data elements requiring more human interpretation.

Publisher

SAGE Publications

Subject

Pharmacology,General Medicine

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3