Affiliation:
1. Biomedical Informatics Training Program, Stanford University, Stanford, CA
2. Department of Biomedical Data Science, Stanford University, Stanford, CA
3. Department of Medicine, Stanford University School of Medicine, Stanford, CA
4. Department of Health Research and Policy, Stanford University School of Medicine, Stanford, CA
5. Center for Biomedical Informatics Research, Stanford University, CA
6. Center for Population Health Sciences, Stanford University, CA
Abstract
Abstract
Objectives
Most population-based cancer databases lack information on metastatic recurrence. Electronic medical records (EMR) and cancer registries contain complementary information on cancer diagnosis, treatment and outcome, yet are rarely used synergistically. To construct a cohort of metastatic breast cancer (MBC) patients, we applied natural language processing techniques within a semisupervised machine learning framework to linked EMR-California Cancer Registry (CCR) data.
Materials and Methods
We studied all female patients treated at Stanford Health Care with an incident breast cancer diagnosis from 2000 to 2014. Our database consisted of structured fields and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results Program (SEER). We identified de novo MBC patients from CCR and extracted information on distant recurrences from patient notes in EMR. Furthermore, we trained a regularized logistic regression model for recurrent MBC classification and evaluated its performance on a gold standard set of 146 patients.
Results
There were 11 459 breast cancer patients in total and the median follow-up time was 96.3 months. We identified 1886 MBC patients, 512 (27.1%) of whom were de novo MBC patients and 1374 (72.9%) were recurrent MBC patients. Our final MBC classifier achieved an area under the receiver operating characteristic curve (AUC) of 0.917, with sensitivity 0.861, specificity 0.878, and accuracy 0.870.
Discussion and Conclusion
To enable population-based research on MBC, we developed a framework for retrospective case detection combining EMR and CCR data. Our classifier achieved good AUC, sensitivity, and specificity without expert-labeled examples.
Funder
Breast Cancer Research Foundation
the Suzanne Pride Bryan Fund for Breast Cancer Research
the BRCA Foundation
the Jan Weimer Junior Faculty Chair in Breast Oncology
the Susan and Richard Levy Gift Fund
the Regents of the University of California’s California Breast Cancer Research Program
National Cancer Institute’s Surveillance, Epidemiology and End Results Program
Cancer Prevention Institute of California
California Department of Health Services
California Health and Safety Code Section
National Cancer Institute’s Surveillance, Epidemiology, and End Results Program
University of Southern California
Public Health Institute
Centers for Disease Control and Prevention’s National Program of Cancer Registries
ASCO Young Investigator Award
Conquer Cancer Foundation and a Damon Runyon Physician-Scientist Training Award
University or State of California
National Cancer Institute
Centers for Disease Control and Prevention
Publisher
Oxford University Press (OUP)