Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data

Author:

Ling Albee Y12ORCID,Kurian Allison W34,Caswell-Jin Jennifer L3,Sledge George W3,Shah Nigam H25,Tamang Suzanne R26

Affiliation:

1. Biomedical Informatics Training Program, Stanford University, Stanford, CA

2. Department of Biomedical Data Science, Stanford University, Stanford, CA

3. Department of Medicine, Stanford University School of Medicine, Stanford, CA

4. Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA

5. Center for Biomedical Informatics Research, Stanford University, CA

6. Center for Population Health Sciences, Stanford University, CA

Abstract

Abstract Objectives Most population-based cancer databases lack information on metastatic recurrence. Electronic medical records (EMR) and cancer registries contain complementary information on cancer diagnosis, treatment and outcome, yet are rarely used synergistically. To construct a cohort of metastatic breast cancer (MBC) patients, we applied natural language processing techniques within a semisupervised machine learning framework to linked EMR-California Cancer Registry (CCR) data. Materials and Methods We studied all female patients treated at Stanford Health Care with an incident breast cancer diagnosis from 2000 to 2014. Our database consisted of structured fields and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results Program (SEER). We identified de novo MBC patients from CCR and extracted information on distant recurrences from patient notes in EMR. Furthermore, we trained a regularized logistic regression model for recurrent MBC classification and evaluated its performance on a gold standard set of 146 patients. Results There were 11 459 breast cancer patients in total and the median follow-up time was 96.3 months. We identified 1886 MBC patients, 512 (27.1%) of whom were de novo MBC patients and 1374 (72.9%) were recurrent MBC patients. Our final MBC classifier achieved an area under the receiver operating characteristic curve (AUC) of 0.917, with sensitivity 0.861, specificity 0.878, and accuracy 0.870. Discussion and Conclusion To enable population-based research on MBC, we developed a framework for retrospective case detection combining EMR and CCR data. Our classifier achieved good AUC, sensitivity, and specificity without expert-labeled examples.

Funder

Breast Cancer Research Foundation

the Suzanne Pride Bryan Fund for Breast Cancer Research

the BRCA Foundation

the Jan Weimer Junior Faculty Chair in Breast Oncology

the Susan and Richard Levy Gift Fund

the Regents of the University of California’s California Breast Cancer Research Program

National Cancer Institute’s Surveillance, Epidemiology and End Results Program

Cancer Prevention Institute of California

California Department of Health Services

California Health and Safety Code Section

National Cancer Institute’s Surveillance, Epidemiology, and End Results Program

University of Southern California

Public Health Institute

Centers for Disease Control and Prevention’s National Program of Cancer Registries

ASCO Young Investigator Award

Conquer Cancer Foundation and a Damon Runyon Physician-Scientist Training Award

University or State of California

National Cancer Institute

Centers for Disease Control and Prevention

Publisher

Oxford University Press (OUP)

Subject

Health Informatics

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3