A hybrid approach to scalable real-world data curation by machine learning and human experts-Reference-Cited by-同舟云学术

A hybrid approach to scalable real-world data curation by machine learning and human experts

Published:2023-03-08 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Waskom Michael L.,Tan Katherine,Wiberg Holly,Cohen Aaron B.,Wittmershaus Brett,Shapiro Will

Abstract

AbstractObjectiveMachine learning has the potential to increase the scale of real-world data curated from electronic health records, but maintaining a high standard of data quality is important to avoid biasing downstream analyses. To increase scale without compromising quality, we propose a hybrid data curation methodology that employs both manual abstraction by clinical experts and automated extraction by machine learning models.Materials and MethodsOur methodology makes the determination about when to employ manual abstraction using a confidence score associated with each model output. We describe a process for selecting confidence thresholds based on simulations validated against a reference-standard labeled dataset. To establish the fitness of our methodology for retrospective research, we apply it to a multi-variable cohort selection task on a large real-world oncology database.ResultsOnly small amounts of manual abstraction are required for hybrid curation to achieve expert-level error rates. In fact, the hybrid methodology can even reduce error rates relative to manual abstraction in some cases. We further demonstrate that demographic characteristics of a research cohort defined using hybrid variables are comparable to one curated with conventional methods.DiscussionOur methodology is general and makes few assumptions about the clinical variable or machine learning model. A key requirement is the availability of reference standard labels for calibrating the tradeoff between abstraction effort and data quality.ConclusionIncorporating machine learning into real-world data curation using hybrid methodology holds the promise to scale practicable cohort sizes while maintaining data fitness for research purposes.

Publisher

Cold Spring Harbor Laboratory

Reference28 articles.

1. Real-world data: towards achieving the achievable in cancer care

2. Real-world Data for Clinical Evidence Generation in Oncology

3. Randomized controlled trials and real-world data: differences and similarities to untangle literature data

4. The use of real-world data in cancer drug development

5. Opportunities and challenges in leveraging electronic health record data in oncology

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Raising the Bar for Real-World Data in Oncology: Approaches to Quality Across Multiple Dimensions;JCO Clinical Cancer Informatics;2024-03

2. The emerging role of real-world data in oncology care in Japan;ESMO Real World Data and Digital Oncology;2023-12

3. Approach to machine learning for extraction of real-world data variables from electronic health records;Frontiers in Pharmacology;2023-09-15

4. A Natural Language Processing Algorithm to Improve Completeness of ECOG Performance Status in Real-World Data;Applied Sciences;2023-05-18