Impact of possible errors in natural language processing-derived data on downstream epidemiologic analysis

Author:

Lan Zhou12,Turchin Alexander23ORCID

Affiliation:

1. Center for Clinical Investigation, Brigham & Women’s Hospital , Boston, MA 02115, United States

2. Harvard Medical School , Boston, MA 02115, United States

3. Division of Endocrinology, Brigham & Women’s Hospital , Boston, MA 02115, United States

Abstract

Abstract Objective To assess the impact of potential errors in natural language processing (NLP) on the results of epidemiologic studies. Materials and Methods We utilized data from three outcomes research studies where the primary predictor variable was generated using NLP. For each of these studies, Monte Carlo simulations were applied to generate datasets simulating potential errors in NLP-derived variables. We subsequently fit the original regression models to these partially simulated datasets and compared the distribution of coefficient estimates to the original study results. Results Among the four models evaluated, the mean change in the point estimate of the relationship between the predictor variable and the outcome ranged from −21.9% to 4.12%. In three of the four models, significance of this relationship was not eliminated in a single of the 500 simulations, and in one model it was eliminated in 12% of simulations. Mean changes in the estimates for confounder variables ranged from 0.27% to 2.27% and significance of the relationship was eliminated between 0% and 9.25% of the time. No variables underwent a shift in the direction of its interpretation. Discussion Impact of simulated NLP errors on the results of epidemiologic studies was modest, with only small changes in effect estimates and no changes in the interpretation of the findings (direction and significance of association with the outcome) for either the NLP-generated variables or other variables in the models. Conclusion NLP errors are unlikely to affect the results of studies that use NLP as the source of data.

Funder

Patient-Centered Outcomes Research Institute

Publisher

Oxford University Press (OUP)

Subject

Health Informatics

Reference25 articles.

1. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review;Kreimeyer;J Biomed Inform,2017

2. Natural language processing: an introduction;Nadkarni;J Am Med Inform Assoc,2011

3. A systematic review of natural language processing in healthcare;Iroju;Int J Inform Technol Comput Sci,2015

4. Using clinical natural language processing for health outcomes research: overview and actionable suggestions for future advances;Velupillai;J Biomed Inform,2018

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Natural Language Processing for Diabetes Digital Health;Diabetes Digital Health, Telehealth, and Artificial Intelligence;2024

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3