Evaluation of Structured, Semi-Structured, and Free-Text Electronic Health Record Data to Classify Hepatitis C Virus (HCV) Infection

Author:

Fong Allan1ORCID,Hughes Justin2,Gundapenini Sravya13,Hack Benjamin4,Barkhordar Mahdi2,Huang Sean Shenghsiu5,Visconti Adam26,Fernandez Stephen1,Fishbein Dawn17

Affiliation:

1. MedStar Health Research Institute, Hyattsville, MD 20782, USA

2. MedStar Health, Columbia, MD 20037, USA

3. School of Medicine, Ross University, Miramar, FL 33027, USA

4. School of Medicine, Georgetown University, Washington, DC 20007, USA

5. Department of Health Management and Policy, School of Health, Georgetown University, Washington, DC 20007, USA

6. Department of Family Medicine, MedStar Georgetown University, Washington, DC 20010, USA

7. MedStar Washington Hospital Center, Washington, DC 20010, USA

Abstract

Evaluation of the United States Centers for Disease Control and Prevention (CDC)-defined HCV-related risk factors are not consistently performed as part of routine care, rendering risk-based testing susceptible to clinician bias and missed diagnoses. This work uses natural language processing (NLP) and machine learning to identify patients who are at high risk for HCV infection. Models were developed and validated to predict patients with newly identified HCV infection (detectable RNA or reported HCV diagnosis). We evaluated models with three types of variables: structured (structured-based model), semi-structured and free-text notes (text-based model), and all variables (full-set model). We applied each model to three stratifications of data: patients with no history of HCV prior to 2020, patients with a history of HCV prior to 2020, and all patients. We used XGBoost and ten-fold C-statistic cross-validation to evaluate the generalizability of the models. There were 3564 unique patients, 487 with HCV infection. The average C-statistics on the structured-based, text-based, and full-set models for all the patients were 0.777 (95% CI: 0.744–0.810), 0.677 (95% CI: 0.631–0.723), and 0.774 (95% CI: 0.735–0.813), respectively. The full-set model performed slightly better than the structured-based model and similar to text-based models for patients with no history of HCV prior to 2020; average C-statistics of 0.780, 0.774, and 0.759, respectively. NLP was able to identify six more risk factors inconsistently coded in structured elements: incarceration, needlestick, substance use or abuse, sexually transmitted infections, piercings, and tattoos. The availability of model options (structured-based or text-based models) with a similar performance can provide deployment flexibility in situations where data is limited.

Funder

Gilead Sciences, Inc.

Publisher

MDPI AG

Subject

General Medicine

Reference27 articles.

1. World Health Organization (2022, May 05). Combating Hepatitis B and C to Reach Elimination by 2030: Advocacy Brief. Available online: https://apps.who.int/iris/handle/10665/206453.

2. (2022, May 05). HCV in Pregnancy. Available online: https://www.hcvguidelines.org/unique-populations/pregnancy.

3. (2022, May 05). Indian Health Service Highlights Initiative to Eliminate Hepatitis C and HIV/AIDS in Indian Country during National Native HIV/AIDS Awareness Day|2019 Press Releases, Available online: https://www.ihs.gov/newsroom/pressreleases/2019pressreleases/indian-health-service-highlights-initiative-to-eliminate-hepatitis-c-and-hiv-aids-in-indian-country-during-national-native-hiv-aids-awareness-day/.

4. Awareness of infection, knowledge of hepatitis C, and medical follow-up among individuals testing positive for hepatitis C: National Health and Nutrition Examination Survey 2001–2008;Denniston;Hepatology,2012

5. Hepatitis C virus universal screening versus risk based selective screening during pregnancy;Waruingi;J. Neonatal Perinat. Med.,2015

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Natural Language Processing in Electronic Health Record Mining for Clinical Decision Support;2023 International Conference on Artificial Intelligence for Innovations in Healthcare Industries (ICAIIHI);2023-12-29

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3